Perils of Eyeballing
Drawing conclusions from data without performing a statistical test is referred to by researchers as eyeballing. The problem is that you can't trust your eyeballs.
Let's suppose we have a test that is supposed to predict whether children will have difficulty learning to read. We administer it to 100 children just before they begin to receive their first instruction in reading. The test identifies ten children as likely to have difficulty.
At the end of the school year we test the children's ability to read and find that eight of the ten children who were identified as likely to have difficulty reading actually do have difficulty. Should we decide to give all beginning pupils the test before the next school year and then provide extra help to the children the test identifies as likely to have difficulty?
No, we should not. We do not have enough evidence, and we have not even fully examined the evidence we have. First of all, we need to consider the children who were not identified but nevertheless ended up having difficulty reading. Let's say there are four such children. That means the test made six mistakes in identification all told – the two children who were supposed to have difficulty but didn't, and the four who were not supposed to have difficulty but did.
If we hadn't used the test we would in effect have been assuming that no children would have difficulty. Therefore, we would have made twelve incorrect predictions – for the twelve children who did have difficulty. Can we conclude that the test reduced the number of errors by half?
Before we can draw that conclusion we have to assess the statistical significance of the difference between the numbers of correct and incorrect predictions under the two approaches. We have to use the right test to assess it, too.
If we perform what is called a contingency test, which is essentially a test of the correspondence between prediction and outcome, we will find a statistically significant relationship. However, the relationship is statistically significant chiefly because of the success of the test in identifying children who will not have problems (it correctly identifies 85 of those children). We, however, are interested in identifying children who will have problems.
The best approaches are to compare the numbers of correctly identified children with reading problems using a proportions test or to perform a test of goodness of fit. Either one will tell us that the test has not significantly increased the number of children correctly identified as likely to have difficulty learning to read.
When you try to predict membership in a class, the difficulty of prediction increases as the size of the class decreases below 50%. To demonstrate the effectiveness of prediction of membership in small classes you usually need large samples, and the necessary sample size is inversely proportional to the size of the class. As usual, the best decision will be made not with the eyeballs but with the intellect.