Correlation and Explanation

In the article about regression analysis I talked about satisfaction being explained by other variables, and here I'm going into a bit more detail about what it means to say that. The phenomenon investigated in the example used in that article was a correlation between i) attitude to a training program on entry to it and ii) satisfaction with the program at the end. To understand what explanation means in regression analysis, we need to look at correlation first.

When we say that two variables are correlated, we mean that knowledge of one enables you to predict the other with known accuracy. Correlation may be measured with several different statistics, but the one used for this purpose is the Pearson product-moment correlation coefficient. It ranges from 1 to -1. The higher the coefficient, the more alike two variables are, the lower the coefficient the more unlike they are. Of course, even when they are unlike you still may be able to predict one from the other. A coefficient of 1 and a coefficient of -1 both imply perfect correlation. A coefficient of 0 implies that the value of one variable cannot be predicted from knowledge of the other.

The accuracy with which you can predict one variable from knowledge of another is measured with the square of the correlation coefficient. This square tells you the percentage by which you have improved your prediction of the second variable by knowing the first. It's assumed, of course, that if you don't know the first you have no ability to predict the second.

For an example I went back to some achievement test results collected in a project I worked on years ago. The achievement test had four scales: vocabulary, reading, math concepts, and math problem solving. Unsurprisingly, there was a statistically significant correlation between vocabulary and reading scores. The correlation coefficient was .77, and the square of the correlation coefficient is .77 X .77 = .59. So, if we're trying to predict reading scored from the vocabulary scores, we can say that we can say that knowledge of the vocabulary score made predictions of the reading score by 59% – we could predict 0% of the variance to start with and now can predict 59% (there are other measures of accuracy you could use if you need a measure which is more easily interpreted).

In fact, the scores on all four scales were significantly correlated. I should note that educators have not ignored this type of correlation, although it does not necessarily imply any problem.

People often note that correlation does not imply causation. Well, there's not much, if anything, you can do in research that will imply causation, but the point is that a correlation does not imply that if you manipulate one variable you will control the other. For example, ice cream sales are correlated with the frequency of drowning, but if you outlawed sales of ice cream you wouldn't lower the number of drownings. Ice cream sales and drownings are correlated because they are both correlated with the air temperature – when it's hot people are more likely to eat ice cream and to go swimming. When one goes up so does the other, and voilà – there's your significant correlation.

Textbooks often classify different strengths of correlation. For example, a coefficient of .90 or above (or -.90 or below) will be said to be strong, correlations of .70 to .89 and -.70 to -.89 will be said to be moderate, and so on. However, the value of a correlation coefficient depends on its practical implications rather than on arbitrary numerical standards. For example, if the correlation is with expenses, and the expenses are large, a small significant correlation can still have important practical consequences.

Correlation and Explanation © 1999, John FitzGerald
Home page | Decisionmakers' index | E-mail