Research, analysis, evaluation

Data Makeovers
A common belief these days is that the terms data and information are synonymous. Data are considered to be naturally informative.

In fact, data need not be informative, for several reasons discussed on this site. For example, data may be unreliable or invalid, in the sense that these two terms are defined in the article on testing.

Data may also simply refuse to behave the way they should behave if the statistical technique you're going to use is going to work. A common characteristic which makes data difficult to analyze is skew, which is discussed in the article on averages. Briefly, data are skewed when their mean and median (which are also discussed in the article on averages) are different. If the mean (the arithmetic average) is higher than the median (the score in the middle of the distribution) the data are positively skewed, and if the mean is less than the median the data are negatively skewed.

So, you may ask, what? The problem is that skew reduces the power of many statistical tests (so-called parametric tests) to detect differences or relationships. You could use non-parametric tests, but these techniques are usually less able than parametric ones to detect a difference or relationship, so you would usually be trying to solve a problem of loss of power by using a less powerful technique.

A better way to deal with this problem is to transform the data. For example, if your data have a negative skew, you can analyze their squares instead. Squaring the data reduces the skew, usually to where it's within acceptable limits. If your data have a positive skew, you analyze their logarithms instead.

Other transformations can be used to solve other problems. To know whether or not to use such transformations, though, you have first of all to be aware of the problems they're intended to solve. That's one of the reasons I'm against conducting data mining without the help of someone trained in statistical analysis. Obviously, there is a definite possibility of missing important relationships in the data if you do not transform them appropriately.

Related article

Data Makeovers © 2000, John FitzGerald
Home page | Decisionmakers' index | E-mail