Research, evaluation, analysis

Data Refining

Many people have the idea that the more data they can collect the better. Unfortunately, they often end up collecting data which duplicate the data they have already collected.

A classic example is socioeconomic data. A lot of socioeconomic variables turn out to be in large part measures of income. Income is correlated with schooling, for example the higher the income, the more years of schooling, on the average. So, when you collect information about schooling as well as income you are to a great extent collecting old information.

The best approach with data like these is to treat them the same way you would an academic test. For example, if you were to give students 30 questions about history to answer, you wouldn't consider that you had given them 30 tests. You'd consider that you had given them one test consisting of 30 questions. Furthermore, if you'd designed your questions properly you could then assess this assumption statistically. Using psychometric techniques you could find which items seemed to be measuring the same thing and which seemed to be measuring something else. If only a few items were measuring something else you could disregard them in your scoring. If a large number were measuring something else you could calculate separate scores for each of the factors the test appeared to be measuring.

If you do this type of analysis with any large set of information you intend to use in decisionmaking, you will almost certainly reduce the complexity of the decision. You will furthermore avoid giving undue weight to variables which are measures of the same thing, and, through scaling, get more accurate estimates of factors which are measured by several variables.

Data Refining © 2000, John FitzGerald
Home page | Decisionmakers' index | E-mail