Research, analysis, evaluation

PC Database Fitness: Part 2

The second of three excerpts from a report on getting the most out of the data in your database.

Missing Data. Let's say that you have surveyed 200 people by questionnaire, and want to record their answers in a database. You convert all the data to numeric codes, and start entering the data. While coding the questionnaires, you notice that several people had not answered all the questions, so when you enter their responses you leave the field blank for any question they didn't answer.

The problem is that the database software will often treat those blanks as zeros. If the questions are rating scales, and you calculate the mean rating, the software will treat all those missing records as valid records with a value of zero. As a result, the mean ratings calculated by the software will be misleadingly low. If the codes are simply designations of response alternatives, the inclusion of the blank records will invalidate the calculation of the percentages of respondents giving each alternative.

Data can also go missing for other reasons. For example, in relational databases, a record in one file which has an erroneous linking code which does not appear in the other files in the database will often be omitted from reports. The simplest way to avoid this problem is to verify the data when entering it so that erroneous codes don't get entered.

If the data have not been verified, they can be carefully cleaned. Then, to avoid errors due to missing data, you can produce reports of calculated fields. The simplest approach would be to calculate a field which tells you whether or not the data in the original field is valid. The summary information can then be broken down by the calculated field.

Alternatively, the calculated field can estimate a value for the missing data (if the data are not missing, it will simply report the correct value). You can even calculate a number of estimates by different methods and compare them, which helps improve the reliability of the estimates.

To make it easy to calculate the new field, you can use an unmistakable code for missing data. One commonly used code for numerical data consists of filling the field with 9s. For example, if the field has four digits or columns, you enter 9999 for missing data. If 9999 is a possible legitimate value, use -999 as the code for missing data. If -999 is a legitimate value, increase the field to five digits and make 99999 the code for missing data. You can then calculate a new variable which tells you whether or not the field has the code for missing data in it, or one which estimates a value for any field with the code for missing data in it.

Another commonly used code for missing data is zero. This code can often be inconvenient, though. For example, many people code the answers to Yes/No questions by using 1 for Yes, 2 for No, and zero for an unanswered question. This practice, however. complicates the calculation of the percentages of people who answered Yes to each question. If you use zero as the code for No, and 9 as the code for an unanswered question, all you have to do to get these percentages is calculate the mean for each field.

If you have relatively small amounts of missing information, you can just remove the records with missing information from your analyses. If you are using your database to calculate ratings, and you have relatively large amounts of missing data, the most effective approach is probably to have the ratings done with statistical software. Professional statistical packages have powerful options for dealing with missing data, and can also evaluate the effectiveness of your ratings.

Data Petrification. For data to be informative, they must vary adequately. For example, if you have a database of customers' responses to a satisfaction survey, and all the customers are from Ontario, you won't find any relationship between province of residence and satisfaction.

The same problem will arise when there is very little variation. For example, if 95% of the customers in your database are from Ontario, it still will be very difficult to find a relationship between province of residence and satisfaction. There wouldn't be enough data from outside Ontario to affect the results.

This problem often arises unnecessarily. Usually it arises from premature classification of data into categories. For example, let's suppose that a company is collecting information about its workforce. One of the items of information it is collecting is the percentage of full-time staff in each department. This percentage is recorded in one of these five categories: 0% to 19%, 20% to 39%, 40% to 59%, 60% to 79%, 80% to 100%.

After collecting the information, the company finds that every department reported that 80% or more of its staff were full-time. In other words, the company can't use this information to distinguish one department from another.

What could the company have done to avoid this result? The simplest and best procedure is to collect the actual numbers of full-time and part-time staff, and let the database calculate the percentages. Then the company's researchers could inspect the distribution of percentages and see what the most useful classification would be.

For example, after inspecting their data, they might decide to have two groups: departments in which 90% or more of the employees are full-time, and departments in which fewer than 90% of the employees are full-time.

Another type of information which can create this problem is age. For example, if you ask respondents to a survey to check off the age range appropriate to them, you may end up with the great majority of the respondents in one age range. Even if it's the age range you want most of your respondents to be in, you would still have obtained more useful information if you had collected the actual age. For example, you might find that within that age range, most respondents were at one end of the range or the other. In general, you will be much better off if you avoid classifying any data until it's in the database.

Ideally, any groups you set up will all be the same size, or roughly the same size. However, that is often not possible. To decide how big the largest and smallest groups should be, I use a psychometric rule of thumb: no more than 70% of the sample in the largest group, no fewer than 30% in the smallest.

Further advice about percentages. When you're doing a survey, never ask people to calculate percentages. First of all, many of them won't do it correctly. A large proportion of your questionnaires will come back with the percentages adding up to more than 100%, or to less. And many of the people who do know how to calculate percentages will calculate them using different percentage bases than the ones you expected them to use.

Your database, however, will calculate percentages exactly the way you want it to. The best approach is to ask for the data you need to calculate the percentage, then have the database calculate it. It's also less time-consuming for the people who fill out the surveys.

Weight Problems. Weighting is one of most widespread practices in data analysis. It is also one of the least advisable. If you're not familiar with weighting, it comes in two forms. The first is the practice of assigning more importance to some records than to others. This form of weighting is used to make subgroups in the file appear to be of equal size, or of a size proportionate to their representation in the population. If, say, you have 200 satisfaction questionnaires from men and 400 from women, you might decide to multiply the men's results by two. That way you could more easily estimate figures for the general population.

The second form of weighting is the practice of giving more importance to some fields than to others. This is often done when several fields in a database are combined into ratings.

To weight fields mathematically, you multiply them by a value. For example, if you think one field is twice as important as another, you would multiply it by two. We will look at this type of weighting in the next section. For the moment we will look only at weighting of records.

If you just want to use the data in your database to arrive at rough estimates of characteristics of the population, weighting records is probably not a disastrous idea, but there is probably no reason to make any stronger recommendation than that. Furthermore, if you want to estimate the accuracy of these estimates, you cannot use the usual formula for the confidence interval.

Most importantly, if the sample is that much different from the population, you would probably be better off to find out why it's so different before you started treating your data as if they were representative. If the differences between subgroups are due to differences in the representativeness of the sampling of each subgroup, weighting the results will just compound the bias in your sample.

If you wanted to use statistical tests to compare subgroups of your database -- men and women, for example -- you most definitely could not weight your results. People do perform statistical tests on weighted results, but when they do, they are pretty well ensuring that they will draw incorrect conclusions about their data. A simple example will show why.

Let's suppose you flipped a Canadian quarter 100 times, and got 50 heads and 50 tails. Then let's suppose you flipped an American quarter ten times, and got six heads and four tails.

If you multiply your results with the American quarter by ten, you get a sample of the same size as the Canadian one, but in which you observed 60 heads and 40 tails. Applying a two-tailed chi-square test, you find that the difference between the Canadian sample and the weighted American one is statistically significant. You conclude that Canadian coins are honest while American ones are crooked.

That conclusion is false. No one would draw a conclusion about the fairness of a coin after observing six heads in ten tosses. The chi-square test finds a significant difference only because it assumes that a sample of 100 has 100 cases in it, not 10. If the test is performed on the actual results, no significant difference is found.

Statistical tests are designed to compare groups of varying sizes, so weighting records does not make comparisons easier, it makes them questionable. If you still want two groups of equal size to compare, you are better off not to weight one of the groups but either to discard data randomly from one group, or to collect more data for the other.

(next: rating and scaling, incoherent data, weighting of fields; click on the link below)

PC Database Fitness © John Fitzgerald, 1995

 Part 3  | Home pageDecisionmakers' page | E-mail