What Are Education Tests For, Anyway?

Pay attention to this piece. There's going to be a test at the end.

Did that trigger scary memories of the 10th grade? Or are you just curious how you'll measure up?

If the answer is "C: Either of the above," keep reading.

Tests have existed throughout the history of education. Today they're being used more than ever before — but not necessarily as designed.

Different types of tests are best for different purposes. Some help students learn better. Some are there to sort individuals. Others help us understand how a whole population is doing.

But these types of tests are easily confused, and more easily misused. As the U.S. engages in another debate over how — and how much — we test kids, it might be helpful to do a little anatomy of assessment, or a taxonomy of tests.

Teachers divide tests into two big categories: formative and summative.

Formative assessment, aka formative feedback, is the name given to the steady little nudges that happen throughout the school day — when the teacher calls on someone, or sends a student up to the board to solve a problem, or pops a quiz to make sure you did the reading.

Any test given for purely diagnostic reasons can also be formative. Say a new student comes to school and teachers need to see what math class she should be in. What distinguishes formative assessments is that they're not there to judge you as a success or failure. The primary purpose is to guide both student and teacher.

Nobody really argues against formative tests, so let's forget about them for now.

Summative assessment, on the other hand, sums up all your learning on one big day: the unit test, the research paper, the final exam, the exhibition.

When it comes to summative tests, U.S. schools really love a particular subcategory of them: psychometrically validated and standardized tests. Psychometrics literally means "mind measurement" — the discipline of test-making. It's a statistical pursuit, which means it's mostly math. Giant chunks of social science are based on the work of 19th century psychometricians, who came up with tools and concepts like correlation and regression analysis.

But the most famous of those tools is the bell curve. Almost any aspect of the human condition, when plotted on a graph, tends to assume this famous shape: crime, poverty, disease, marriage, suicide, weight, height, births, deaths. And, of course, when Alfred Binet developed the first widely used intelligence tests in the late 1800s, he made sure that the results conformed to that same bell curve.

Why does it matter that most of our tests are written by specialists in statistics? Well, when psychometricians write a test, they spend a lot of time ensuring standardization and reliability.

Reliability means if you give the same test to the same person on two different occasions, her scores should not be wildly different. And standardization means that, across a broad population, the results of the test will conform to an expected distribution — that bell curve, or something like it. If you give the same test to 20,000 people and they all score a 75, that's not a very useful test.

These rules are the reason that 4 million U.S. students are taking extra tests this year. Not for their own practice, but to test the tests themselves. These are the newly developed tests developed to align with the Common Core of State Standards. Large field tests are required to establish their standardization and reliability.

A psychometric test is historically grounded, mathematically precise and suitable for ranking large human populations. But those strengths can also be weaknesses.

A reliable test doesn't change much from year to year. That can make them easier to coach.
The need for reliable scoring often drives designers to use multiple choice questions, to avoid ambiguity. But that format has a hard time measuring a whole range of crucial human abilities: creativity, problem-solving, communication, teamwork and leadership, to name a few.
The multiple-choice format and the need for predictability mean psychometric tests, whether a state third grade reading test or an SAT, all somewhat resemble each other. And so, they can end up testing a student's test-taking ability more than actual subject knowledge.

Reliability and standardization can be at odds with the third key problem in psychometrics: validity. That is, does this test actually tell us anything important? Especially, is it predictive of future performance in the real world? Validity ideally is established by comparing students' test scores with some sort of ground truth, such as grades in school, or later success in college. But that takes a long time and a lot of number crunching. And in practice the process is often pretty circular: the validity of test results tends to be based on their correlation with other test results.

So, those are the keys to how test makers see the test. But in the world of education, it's not just how they're written, but how they're used.

That brings us to the tests that so many Americans love to hate: so-called high stakes tests. The ones that decide whether our kids move up to the 4th grade, get a full-ride scholarship, or someday, a job.

In practice, we accept one kind of high stakes test: the standalone gatekeeper test. Everyone wants a pilot who passed her licensing exam, or a lawyer who passed the bar. We like transparent, objective standards, especially when it's other people who have to meet them.

No, it's the other kind of high-stakes test that draws the most ire: accountability tests. They get this name because they are given to judge the performance of schools, teachers and states, not just students. Accountability tests determine school reorganization and closure decisions, teacher evaluations and state funding.

So, got all that? Good. Now here's your essay question:

Under the federal No Child Left Behind law, passed in 2001, public school accountability has rested largely on the results of psychometrically validated and standardized, largely multiple-choice summative assessments covering math and English only, given annually in 3rd through 12th grades. Given what you've just read about the strengths and weaknesses of this test format, is it wise to attach so many consequences to their results? State the reasons for your response.