I probably say somewhere on this site that multiple regression analysis is overused, and indeed it is. Nevertheless, it does have valuable uses which I don't want to frighten people away from, so here's an article about one of them.
I recently used regression analysis to clarify for a client the factors affecting satisfaction with training programs the client offers. The client had a measure of consumers' attitudes toward programs on entry to them, and knew that their attitudes were correlated with final satisfaction with the programs – the more enthusiastic consumers were on entry, the more satisfied they were at the end. The question was whether final satisfaction or dissatisfaction with the programs was simply a self-fulfilling prophecy – did consumers say they were satisfied or dissatisfied with the programs simply to justify their initial attitudes?
The client also collected information about consumers' opinions of various characteristics of their programs. This information was not correlated with initial attitude, nor were different types of this information correlated with each other. It was therefore easy, using multiple linear regression analysis, to estimate what proportion of final satisfaction could be explained by initial attitude toward the programs, and then see if characteristics of the programs explained the remainder of the final satisfaction (the residual, as it's known in regression analysis). It turned out that characteristics of the programs were twice as important as initial attitude in determining satisfaction with the programs.
So not only did multiple linear regression analysis determine that satisfaction with the programs was not a self-fulfilling prophecy, it also estimated the relative importance of initial attitude and of the actual characteristics of the programs. The analysis was made easier by the lack of correlation between the different types of information collected, but correlated information can be analyzed with more complicated designs. The possible existence of correlation, though, is the chief reason you shouldn't try this at home. Statistical and database software make it easy to do multiple linear regression analysis, but if you don't know how to deal with correlated variables or how to identify outliers (extreme observations which distort the results), you'll often get the wrong results when you use that software.
Of course, it is also important that you use a proper hypothesis-testing design. Just turning multiple linear regression loose on a set of data, as in data mining, is almost certain to produce a large proportion of unhelpful or misleading results.
Better Living through Regression Analysis © 1999, John FitzGerald
Home page | Decisionmakers' index | E-mail