In the earlier article on data mining, I mentioned multicollinearity and outliers as two technical details that can make your data mine collapse. This week I thought I'd provide some more detail about these problems.
Multicollinearity arises when you are trying to assess the relationships between a dependent or criterion variable – sales, for example – and several independent variables. The independent variables will often be correlated – that is, a change in one will be related to corresponding changes in the others. For example, in educational research much effort has gone into explaining academic achievement as the result of socio-economic status (SES). SES is assessed by several different variables (income, education, job, etc.) which tend to be correlated.
So what's wrong with that? The problem arises when you try to perform your analysis using an automated procedure for constructing a multiple linear regression equation, which is how people often try to perform this type of analysis. If any independent variables are correlated both with each other and with the criterion, and you construct your regression equation automatically, the regression equation will tend to be unstable from one set of data to the next. The order in which the correlated independent variables will enter the regression equation will be largely due to small random differences in correlation with the criterion. To make a long story short, although you will have isolated the independent variables which explain the dependent variable, your equation based on these independent variables will still not explain the dependent variable.
When I discover correlations between independent variables I often just scale them and enter the scale score as a single variable. There also may be good theoretical reasons to justify a particular order of entry of independent variables into the regression equation, and if there are you should use that order of entry.
Outliers are simply deviant data. For example, if you go bowling four nights in a row, but have a migraine on the fourth night, your performance on the fourth night will bias any estimate of your bowling prowess based on all four nights' results.
Similar phenomena happen in regression analysis. Outliers are said to bump the regression line – that is, they can distort the estimate of the relationship between the independent and dependent variables. Outlying data should be identified so that you can investigate the circumstances in which they were collected. If your investigation shows that any data were likely to have been affected by extraneous factors, you should omit them from your analysis.
I also mentioned in the earlier article the necessity of analyzing samples rather than your entire set of data. The next article will go into more detail about that important consideration.