Research, analysis, evaluation

Fat-Free Research

(an article featuring probably the first use in human history of the phrase
"amorphous mass of analytical blubber")

One of our favourite occupations is accumulating things. First of all, we like to accumulate money, which is usually understandable. But we also like to collect art, antiques, postage stamps, old coins, hockey cards, baseball cards, Barbies all the way down to items like elastic bands and string. The collecting urge is quite simply one of the defining characteristics of our culture.

This urge is often reflected in our research, much of which seems to be conducted on the assumption that if one piece of information is useful, then ten thousand pieces will be ten thousand times as useful. Improvements in computer technology have greatly increased our ability to gratify this urge. We can produce hundreds of analyses of enormous data files in a few seconds.

The result has been an increase in the production of massive research reports, which often consist largely of page after page of frequency tables and crosstabulations. Of course, there is nothing intrinsically wrong with a long research report. I've produced the occasional epic myself.

However, a great many long research reports get that way simply because they are full of unnecessary analytical fat that is, misleading misinformation. Because of all this fat, the actual implications of the research are obscured, and decisions become more difficult to make and more time-consuming. Decisionmakers are in effect forced to spend most of their decision- making time searching for needles in haystacks.

Fat doesn't get into these reports by accident, though. It's the result of several types of inefficient analysis. If these types of analysis are either avoided or supplemented with other types of analysis, research reports can be shaped up into pithy, hardhitting summaries of crucial information.  Here are some of the sources of analytical fat.

1. Undercomparison. Strange as it may seem, reports which "analyze" data without using statistical tests are still common. Here are some actual research results that I recently re-analyzed for a client (who has given me permission to reproduce them). To protect the analytically flabby, I will disguise them as the percentages of respondents in four regions reporting that they were very satisfied with service:

Region A -        75% (52 respondents)
Region B -        79% (170 respondents)
Region C -        84% (108 respondents)
Region D -        92% (60 respondents)
Originally, these data were provided for more regions, but I have combined them geographically to simplify the example.

The producers of the original report "analyzed" the data arbitrarily. In the example, they would simply have reported that Region D had the highest satisfaction and Region A the lowest. Differences of this type were even referred to as significant.

However, if you apply a simple statistical test to these data, you find that the difference between regions is not large enough to justify concluding that it is due to anything other than sampling error. You might want to collect more data to see if the trend would hold up in a larger sample and thereby pass muster with the statistical test, but you would not want to conclude from these results that respondents in Region D were more satisfied.

In other words, this is fat. A difference has been reported where none exists, and that obviously is not a good basis for decision-making.

2. Overcomparison. Most research reports do use statistical tests, but problems can still arise when multiple tests are performed. Let's say that you maintain a database of consumer satisfaction ratings, and that you want to compare the percentages of men and women who like a new product you're selling. You compare the percentages of men and women who like your product with a statistical test. You decide to conclude that there is a significant difference between men and women if the statistical test tells you that the probability of the difference occurring through sampling error is less than 5%.

Now, if, in the entire population of people who have tried your product there is no real difference between men's and women's liking for it, there is still a 5% chance that your test will tell you that the difference in your sample is significant. A homely example may help illustrate this point. If you're playing cards, the probability of cutting the deck to the ace of spades is about 2%, a value which satisfies the most common criterion for statistical significance. If you do cut the deck to the ace of spades, though, you don't usually leap to the conclusion that someone's slipped a rigged deck into the game.

This type of difference is usually referred to as a spurious difference, and it's a serious enough problem when you're making only one statistical test. After all, you're concluding that the false is true, which is not the best strategy for going through life, no matter how popular it is. It is especially not a good strategy for spending your money.

What happens, though, if you decide to assess men's and women's liking for two products? What happens is that your chances of taking a spurious difference for a real one soar. If you make two statistical tests, your chances of finding a spurious difference jump to almost 10%.

If you make five tests, your chances of finding at least one spurious difference jump to 23%. With ten tests, your chances jump to 40%. With more tests, it becomes almost a certainty.

A common length for a questionnaire is 30 items. If you perform one cross-tabulation of each item, you have a 79% chance of detecting a spurious difference in the responses to at least one item. Perhaps more importantly, your chances of finding spurious differences on 10% or more of the items are 21%.

Spurious differences, of course, can persuade you that something is happening that really isn't. If there are also real differences between groups, spurious differences can also make it more difficult to interpret the real results by introducing misleading contradictory results. In other words, they can persuade you that something isn't happening that really is. Either way, they make life difficult. And costly.

The best solution to this problem is mathematical scaling, but if you can't do that you have to reduce the significance criterion. Although the reduction will often have to be dramatic, I have found in conducting repeated surveys for clients that reducing the criterion eliminates a large number of differences which turn out to be either spurious or transitory.

3. Overdependence on satisfaction items. Ratings of satisfaction are important, but they do not constitute a sufficient basis for evaluation. Unfortunately, many studies solicit only satisfaction ratings.

What's wrong with rating satisfaction? The first problem is that self-reported ratings are not pure measures of satisfaction. For example, responses to satisfaction items are often affected by concerns such as unwillingness to report a rating which might cost someone his or her job. Similarly, respondents asked to rate free or low-cost services may report unduly high satisfaction because they fear loss of the service if satisfaction is low.

A second problem is that satisfaction ratings do not necessarily predict behaviour. People who are generally unsatisfied with a store, for example, may continue to shop there because it's cheaper. On the other hand, people who are highly satisfied with a service may never recommend it to anyone else.

Ratings of satisfaction are most informative when they are supplemented by information about other aspects of a product or service -- durability, waiting time, price, and so on -- and by appropriate open-ended questions. They are especially informative when the additional information is not exclusively self-reported.

4. Procrustean method. In the long run, one of the biggest obstacles to effective decision-making is the natural tendency to re-use strategies and procedures which have been successful in the past. However, the very success of these strategies and procedures often makes them less valuable in future.

For example, one of the reasons for the widespread use of satisfaction ratings is that their use initially did alert decisionmakers to important deficiencies which, when corrected, greatly improved efficiency and profitability. However, now that organizations are on the alert for these deficiencies, satisfaction ratings are less likely to detect serious problems. Instead, concentrating on satisfaction items will probably lead to other problems being overlooked.

Similar problems arise with the unscientific use of explanatory models. A questionnaire, for example, may be based on a logical or empirical model of the phenomenon being investigated. In science, the point of any research conducted with this questionnaire would be to discover circumstances in which the model is ineffective. When you discover such circumstances, you can improve the model by modifying it to take these circumstances into account.

Sometimes, though, a model is taken to be a full, perfect, and sufficient representation of the phenomenon being investigated. The incompatibility of results with the model is therefore taken to be a sign of a problem in the sample rather than of the inadequacy of  the model. Of course, the problem may well be in the sample, but assuming arbitrarily that it is will inevitability lead to error.

Conclusion. As I have said, there is nothing intrinsically wrong with a long report. However, a little careful consideration in the design stage of a project, and the judicious application of research skill, will ensure that at the end of a big research project you don't find yourself wrestling with a huge amorphous mass of analytical blubber.

Fat-Free Research © 1997, John FitzGerald

Home page | Decisionmakers' index | E-mail