Computers seem to get bigger by the day. These days (2020) a laptop with 32 gigabytes of RAM is commonplace, and still a bit stupefying to old geezers like me who learned to program on a machine with 64 kilobytes of RAM. A lot of the RAM ends up getting used for those pretty graphics that everyone is so fond of, and a lot of the giant hard drive gets used up by the bloated software packages that create those pretty graphics, but you still end up with tremendous power for analyzing enormous files of data.
And because you can analyze enormous files of data, many people do. We do tend to believe that both bigger and more are better. In fact, bigger files are often very desirable. Nevertheless, big files often create big problems, and in this article we will look at some ways to overcome those problems and get the most out of big files.
Files can be big both vertically and horizontally. A file is big vertically if it has a large number of cases (or records, in database terminology). A file is big horizontally when it has a large number of variables (fields).
Problems of vertical size. One of the most important problems with files which are big vertically is the non-sampling error (also known as a mistake). The more cases or records there are in a file, the more likely non-sampling errors become, especially if the increase in the size of the file reduces the time available for the collection of each individual case. For example, if people are under pressure to provide a long list of information, they may record inaccurate estimates or even fabricate information. Big files need to be audited for errors of this type.
Another problem with files with enormous numbers of cases is that statistical tests become so powerful that their results are meaningless. Almost any difference becomes significant at astoundingly low levels.
For example, let's suppose you have a sample of 100 people, and you want to know if women are over or under-represented in it. To find out, you are going to perform a chi-square test with a significance criterion of .05. If we assume that women make up 51% of the general population, the percentage of women in your sample would, to satisfy the chi-square test, have to be 10 percentage points higher or lower than 51% for you to conclude that they were over or under-represented. That is, if 61 of your sample were women, you could conclude that women were over-represented, and if only 41 were women, you could conclude that women were under-represented. Those seem like reasonable standards, but if you use large samples, the standards become far less demanding.
For example, if your sample had 10,000 members, the chi-square test would tell you that women were over-represented if they made up as little as 52% of the sample. If your sample had 100,000 members, women would be over-represented if they made up as little as 51.3% of the sample -- less than one-half of one per cent more than the figure for the population. Those may be real differences, but they probably wouldn't be the types of difference you were hoping to find.
The solution to these problems is to make use of sampling theory. First, you can use sampling theory to determine the most statistically appropriate sample size. For example, if you're conducting a survey, and want a 95% confidence interval of ±5%, you need a random sample of only 385 people.
Of course, other considerations may make collecting more data advisable. If you're collecting huge amounts of data, though, you can still use sampling theory to select a subsample of data to analyze.
In psychometric research, for example, it is often necessary to administer huge numbers of tests. Scaling and reliability analysis, however, are often performed on smaller random samples drawn from the main one, so that statistical tests give more meaningful results. The smaller samples can also be analyzed much more quickly. If you want to check the validity of the results obtained with the small sample, you can draw a second small sample and do the same analyses. You'll still be finished in less time than it would have taken to analyze the entire sample.
If drawing a smaller subsample is not possible, you can adjust your significance criterion. To do that, you have to determine how big a difference or how strong a relationship you're looking for, and the power you want the statistical test to have to detect those diifferences or relationships.
Problems of horizontal size. Having large numbers of variables or fields is not a problem if you know that they all measure different things. Problems arise when different variables measure the same thing but the data analyst assumes they are independent.
These problems are quite common nowadays because statistical packages have given everyone the ability to perform statistical analyses. With every good intention, people enter large numbers of variables into multiple linear regressions without inspecting either correlations or residuals.
The problem with doing that is that it produces unstable solutions. If several variables are correlated with each other, and about equally correlated with the dependent variable, the order in which they are entered into the equation is determined by small and random differences in the size of correlations with the dependent variable. If you perform the analysis on a second set of data (which you often have to do if data are collected yearly, for example), the variables will often be entered in a different order.
The solution I prefer to this problem is to scale the variables. Variables which measure the same thing can be aggregated to produce a single measure.
Reducing the number of variables also helps deal with another problem, dealing with interaction effects, which are often ignored in multiple linear regression analysis. An interaction effect is one which cannot be predicted from the individual (or main) effects of two or more variables. For example, hair loss increases with age, and it is far more common among men – those are what we call main effects of single variables. However, the relationship between age and hair loss is much stronger among men. That is an interaction effect of two variables (you can have higher-order interactions as well). If you don't assess interaction effects you will usually miss important information about the topic you're investigating. To assess interaction effects, you examine residuals and introduce multiplicative terms into your regression equation.
Big files can have big benefits. To obtain those benefits, though, you have to be circumspect.
Plethoratology © 1995, 1999, 2003, 2006, 2020 John FitzGerald
Home page | Decisionmakers' index | E-mail