Research, evaluation, analysis

Testing the Tests
Tests are everywhere these days. They are required for admission to schools and for admission to professional and occupational groups, and they are increasingly being required of applicants for jobs. Throughout North America, the educational accountability movement has created a tidal wave of jurisdiction-wide testing in schools.

So I thought a brief non-technical review of the characteristics of good tests would be helpful. Well, I'll try to be as brief and non-technical as I can be.

Tests are often described as standardized. In the beginning all that meant was that they were standard, but over the years the term standardized has taken on a narrower and more useful  meaning. It is now used by many people in testing to describe a test whose scores have been set by comparison with the performance of a norm group that is, a group considered to be similar to the people for whom the test is intended. Scores, then, for a standardized test of academic achievement would be set by comparison with a group of students of an appropriate age who had taken the test. These scores often take the form of percentiles, which are simply statements of the percentage of the norm group who answered fewer questions correctly. So if you're told that you scored at the seventieth percentile, that means you got more questions right than did 70% of the norm group.

A couple of important considerations about percentiles are worth mentioning. The first is that usually differences in percentile scores mean less the closer you get to the average score. Usually there are more scores near the average than away from it, so a difference between the 50th and 55th percentiles, say, usually represents a smaller difference in the number of questions correct than does  the difference between the 80th and 85th percentiles. The second consideration is important in academic testing. Tests of academic achievement usually produce grade-equivalent scores. A grade-equivalent score of 6.3 means that your score is equivalent to that of a student in the third month of grade 6. Testing companies, however, do not test children in every month of every grade. They simply interpolate the scores by drawing lines between the performances in the grades and months in which they did test children.

Standardized tests have been widely criticized, often with good reason and not all that infrequently with bad. The most serious question about standardized testing is its value for assessing individuals. if an individual's score on a test varies widely with repeated testing, either the test is not accurate or it is attempting to measure something which is not stable. Either way its scores are useless. Tests of intelligence usually claim to be highly stable, but the proof of the pudding is in the eating. Other tests, such as tests of academic achievement, although useful in other ways, are rarely accurate enough for a single assessment to be dependable. If a decision about an individual is made with a test which does not provide accurate individual assessments, then you're likely to get an incorrect decision. Research in Ontario has shown that teachers tend to use individual academic achievement scores as additional sources of information rather than as wholly reliable scores, which is a responsible way to use academic achievement scores.

The chief competitor of standardized testing these days is performance assessment, also known as authentic testing. The idea of performance assessment is simply that the best way to assess someone's mastery of a skill is to have them perform the skill. For example, people aren't allowed to drive on the public highways until they've passed the driver's test we wouldn't let them go on the road just because they passed the written test.

The problem with performance assessment is that it's difficult to devise dependable tests of this type you need spend only a little while in traffic to realize that. Research on educational performance assessments has shown that tests of similar topics often produce dissimilar results. In fact, standardized test results have been found to be more closely related to performance assessment scores than the performance assessment scores are to each other. That isn't surprising, since any individual performance assessment samples only a limited range of behaviour (the driving test, for example, doesn't assess performance when a government agent is not in the car).

Performance assessments are often useful when you can make a lot of performance assessments. Schoolteachers, for example, make lots of performance assessments and keep records of them. They also compare performance against a criterion rather than a norm-referenced standard, which is probably also an important factor in the successful use of performance assessments.

Tests are scored more or less objectively. In testing, objectivity means simply universal agreement. A test is objectively scored if everyone scoring it arrives at the same score. Tests like essay examinations, though, will obviously fall short of this ideal. Proper training of scorers followed by proper monitoring of scoring can produce essay examinations which approach total objectivity. Nevertheless, they won't approach it closely and the interpretation of essay test results, or of the results of any test which is less than completely objective, should include consideration of statistical analysis of its objectivity.

Holistic scoring is scoring which involves consideration of the entire performance of the tested person. That is, a person performs some required tasks (answering questions, for example) or creates a product, and the holistic score is arrived at by consideration of the performance or product as a whole, and not by individual ratings of different aspects of the performance (individual questions, for example, or the ability to make left turns) or product. Objectivity will of course be an especially important consideration with this type of scoring.

Two words often trotted out in discussions of tests are reliability and validity. Reliability is simply consistency. Different types of reliability have been defined. For example, internal consistency is the extent to which different parts of a test produce the same assessment. Obviously, the more internally consistent a test is, the more accurate it can be. Stability is the degree of similarity between scores on repeated administrations of the test at which scores should be similar (what I mean is that you wouldn't have much faith in your ruler if it kept giving you the same estimate of the height of a growing child; the idea of stability assumes that there should be no differences between administrations). Equivalence is  the degree of similarity between scores on different forms of the test. If you're going to do repeated testing you need different forms of the test so that familiarity does not increase scores on the second administration. The forms, of course, should produce similar results.

The reliability of a test is often expressed in various reliability coefficients which have a maximum value of 1. The absolute minimum standard for any of these coefficients is .71 (an argument can be made that one type of coefficient, the split-half, should be at least .83). Any lower figure, for good mathematical reasons, means that the test is simply inadequate. A coefficient of .71 can be taken as representing an improvement of 50% over complete inconsistency, a coefficient of .80 as representing an improvement of 64%, and a coefficient of .90 as representing an improvement of 81%.

Reliability should be assessed at every administration of the test. Tests are used if they have been reliable in the past, but they are only useful to you if they are reliable when you use them. Groups of people tested can differ in many important ways, some of which can affect reliability. It is not unusual to find that a highly acclaimed test fails to live up to its history of reliability when you use it. Usually this is not a reflection on the quality of the test (or of you as a test administrator), but simply a reflection of the facts of life no test is appropriate for everyone. For example, the reliability of many tests varies markedly with the age of the people taking it.

The validity of a test is simply its relevance. A driving test, for example, is valid if it predicts ability to drive. Ability to drive can be measured in a number of ways number of traffic tickets, for example.

Recent years have seen a revival of the reputation of content or face validity, a revival I consider wholly unwarranted. A test has content validity if its items refer to the area of knowledge or to the skills which the test is intended to assess. Obviously, you want a math test to have questions about math, and a driving test to have something to do with driving. However, content validity is no guarantee that the test will have any value. For example, if the items on a math test are too difficult everyone will end up with low scores and it will be difficult to assess differences in mathematical ability. For example, if everyone gets zero then you know nothing about their relative mathematical abilities, even though the test has content validity. These days, though, people often talk as if content validity is an adequate substitute for predictive or concurrent validity.

Predictive validity is a test's ability to predict how a person will perform at a later date on a different assessment of ability performance in school or on a job, for example. Concurrent validity assesses how well a test agrees with a concurrent assessment of a different type this type of validity is important if you want to use the test as a substitute for a less convenient measure. Construct validity is a more complicated concept which involves explaining test scores as the results of psychological concepts intelligence, for example, or motivation. It is chiefly useful as a guide to research, and interpreting the results of research in construct validation requires some statistical  knowledge.

The results of the assessment of predictive and concurrent validity are expressed as correlation coefficients, and again the absolute minimum standard is .71. Often evidence of predictive validity is impossible to obtain. For example, university entrance examinations notoriously fail to predict success in university. The reason, though, is not necessarily the inadequacy of the examinations but the inadequacy of the sample. When you compare scores on the entrance examination with success in university, you're looking at the success only of the highest-scoring students on the entrance examination. If people with middling or low scores on the entrance examination were admitted to university then you could well find a relationship between the entrance examination results and success in university.

That's my brief non-technical guide. If you've found it too brief or too technical, send me some e-mail (there's an e-mail link a few lines farther down).

Testing the Tests © 1999, John FitzGerald
 Home pageDecisionmakers' indexE-mail