In the last article on data mining, I mentioned the issue of sampling. As I have noted in the original article on data mining, you don't want to mine all your data, especially if you have a lot of it.
First, you want to use only a sample of your data to develop your model, so that you can confirm the model on at least one other sample drawn from the same dataset (I have used as many as 10 such samples). Secondly, if you have a huge dataset, pretty well every relationship is going to turn out to be statistically significant regardless of how weak it may be. You don't want to be modelling weak effects, so you draw a reasonable sample (according to both statistical and practical vriteria).
A second consideration in sampling is stratification. I recently conducted two surveys of people throughout Ontario. A goal of both surveys was to compare the responses of people from different regions. If you know Ontario, you'll know that there are a lot fewer people in Northern Ontario than in the south. I therefore oversampled in the north to obtain regional groups that were more or less the same size – if the size of the groups had been proportional to their representation in the population, it would have been easier to detect differences between the more populous regions than between the less populous ones. If the representation of groups varies in your sample, you may want to oversample some of them, too.
In the next article we'll look at what may be the most serious problem of data mining.
Down the Data Mine – Sampling © 2000, John FitzGerald
Home page | Decisionmakers' index | E-mail