The Error Rate Problem

As we saw in the article about statistical significance, in analysis the significance of a difference is inversely proportional to its probability. If a difference is unlikely to have occurred by accident, then it's considered statistically significant.

Nevertheless, even if there is only a slight probability that a difference is accidental, it can still be accidental. An accidental difference which is statistically significant is known formally as a Type I error, and less formally as a spurious difference. Spurious differences are not likely to be a huge problem if you're only testing one difference, but as the number of differences you're testing increases, the likelihood of detecting a spurious difference grows alarmingly, especially if you're using the most popular significance criterion.

For example, if you're assessing ten differences with a significance criterion of 5% (p < .05), you have a 40% chance of detecting at least one spurious difference. If you assess twenty differences, the probability is 64%.

The problem is attenuated considerably if you use the 1% criterion I prefer, but it's still a problem. The equivalent probabilities are 10% in ten comparisons and 18% in twenty.

One implication of this problem can be seen in the common practice of comparing opinion items individually. For example, people might be asked to rate their agreement with ten statements of opinion before they go into a program, and then to rate it again afterwards. If you compare the ratings of each individual item before and after and find one significant difference, you really cannot accept that as evidence of any change in opinion whatever. If you find two significant differences, you still have little reason to argue for a change in opinion. So what can you do about this problem?

If all else fails you can always reduce the significance criterion to a value that produces an acceptable risk of spurious error. The best solution, though, is usually scaling. For example, the ten opinion items may all be intended to measure the same opinion, so scaling will allow you to work out a single attitude score for the ten items (as well as telling you whether you're justified in combining the ratings into a single score). We'll look at how to do that next week.

The Error Rate Problem © 1999, John FitzGerald