Last week we looked at how correlations can be inflated, so this week we'll look at how they can be deflated.
Let's suppose you wanted to find out how well students' marks on graduation from high school predicted their marks in the first year of university. You select a sample of students and correlate their high school marks with their university marks. You will probably fail to find a statistically significant correlation.
This result is counterintuitive, but the reason for it is simple. Only the best students get into university, and even if they do as well in university as they did in high school their marks will fall in a very restricted range. That is, there is simply less difference in ability between the students than there would be if the full range of ability had been sampled, so it is difficult to observe a correlation between their scores.
The distribution of marks will also probably be skewed, which also militates against finding a correlation. Problems like these are why I oppose the idea that people can conduct data mining even if they have no training in inferential statistics.