Down the Data Mine

Down the Data Mine, part 3 – The Data

One of the most serious problems in data mining is that the data being mined often have not been properly prepared for data analysis. They have not been collected for research purposes and consequently they have not been prepared by someone who is familiar with analytical techniques.
For example, data such as age may be categorized in arbitrary ways which reduce the likelihood of detecting a relationship with other information. The usual problem is that too many people are categorized in one or two categories.
Similarly, data such as percentages may have been improperly calculated (believe me – it happens more often than you'd think possible). You also often find that data have been categorized inaccurately. To use the example of age again, you might find that the age categories overlap. If one age category is for people 20 to 25 years old, and the next is for people 25 to 30, in which category do you put a 25-year-old? Again, believe me – it happens more often than you'd think possible.
A data file should also be pruned and primped for data mining. Small categories of data, for example, should be removed to keep them from contaminating the analysis, and non-normal data should be identified and decisions made about how to deal with them (you might decide to use a statistical technique designed for non-normal data, for example, or you might transform the data to normality). Failure to do these things will make it more difficult to detect relationships.
I will emphasize again that I am not opposed to data mining. What I am opposed to is the idea promoted by some software companies that you do not need technical expertise to conduct effective data mining. Simply buying a set of golf clubs will not enable you to play golf as well as Tiger Woods, and simply buying data mining software will not turn you into a data miner. To be effective and accurate, data mining requires the participation both of analytically trained people and of people who understand the field being mined.
First article in the series | Second article
Down the Data Mine – The Data © 2000, John FitzGerald
Home page | Decisionmakers' index | E-mail