Perils of Data Mining
"Like a fiend for his dope or a drunkard his wine, a man will have lust for the lure of the mine."– Merle Travis
One of the consequences of widespread computerization has been widespread data. Computers can store and analyze huge quantities of numbers. and we find ourselves these days pretty well neck deep in them. Naturally, people started to think of mining these huge accretions of numbers to find relationships which might help them in their business.
Data mining has now become a catchphrase. It often seems to stand for exploratory statistical analysis in general, although data mining software usually is intended to construct statistical models. In whatever sense the term is used, however, it refers to exploratory research rather than experimental.
What's so important about that distinction? Well, experimental research follows a set of rules which allows conclusions about data analysis to be drawn easily and with a known degree of confidence. In other words, experimental research is analytical. Data mining, like all exploratory research, is speculative. The relationships it finds are often spurious, and conclusions drawn from data mining are far less dependable than those drawn from experimental research.
That's why it's important to follow the strategy recommended by software companies for the use of data mining software and involve business managers in the conduct of data mining research. Ideally the exercise would involve managers proposing hypotheses, technical staff testing them, and both evaluating the results. This type of procedure was used by the erstwhile Metropolitan Separate School Board of Toronto to establish its list of so-called special needs schools. This list ranked schools according to a number of quantitative criteria (pupil turnover, for example), and the schools with the highest ranks got extra money. A committee of principals was struck each year to propose new criteria and new ways of using the old criteria. This committee also included the research director and another member of the research department (for several years, me). The committee's suggestions could be quickly modelled (usually during the meeting) and then evaluated in the light of both the research members' technical knowledge and the principals' professional knowledge and experience.
The speculative character of data mining also makes it important to take with a grain of salt any suggestions that data mining software does not require any sort of statistical sophistication. Especially when models are being constructed, important technical considerations (definition of outliers, for example, or dealing with multicollinearity) have to be dealt with. To deal with them you first of all have to know what they are. Assurances that "you are insulated from technology details and can concentrate on higher level problem solving", as one company says about its data mining software, should be regarded as dangers rather than as advantages. You might as well be told that your car has no speedometer so that you may be insulated from the Highway Act. In modelling you simply have to know the details, and attempting to solve problems without knowing those details should not be described as "higher level" but as shooting in the dark.
Even in less complicated forms of data mining – simple comparisons of subgroups in a sample, say – problems can arise if the correct approach is not taken. For example, if you find a correlation between age and the probability of buying your product, you'd like to know what is the best age group to target. Arriving at that decision requires careful analysis. Age groups have to be carefully defined independently of propensity to buy. If you simply pick the ages with the highest probability, you're almost certain to find when you start marketing that you don't get the results you expected (for one thing, you get regression to the mean).
In any form of data mining you should also be working with samples rather than your entire data set. For one thing, you want to be able to verify conclusions drawn from one sample with data from another. For another thing, you don't want to be using a sample so large that almost any relationship, no matter how weak, becomes statistically significant. For yet another, you don't want to waste valuable data. Once modelled, data are used up. Judicious sampling will enable you to get the most analytical value out of your data. To sample judiciously you need to know something about sampling.
In general you should no more conduct data mining without considering technical details than you should conduct real mining without considering technical details. In real mining, for example, you wouldn't do anything that would cause your tunnels to collapse, and in data mining you don't want to do anything that will cause your conclusions to collapse when they're brought into the cold light of day.
Of course, that's easy for me to say. I make my living selling statistical services and I believe statistical services are valuable, so of course I have my own prejudices. However, the opinions I have expressed here are based on experience. I've wrestled with these issues myself, and I think the advice presented here will reduce the likelihood of your being pinned if you decide to wrestle with them yourself.
Perils of Data Mining © John FitzGerald, 1999
Home page | Decisionmakers' index | E-mail