Aspects of Datasets - Part 2

Tim Brock / Friday, July 31, 2015

This is the second (and final) article looking at key aspects of datasets. Having previously covered relevance, accuracy, and precision, here we will consider consistency, completeness and size.

Consistency

On the 23rd of September 1999, NASA's Mars Climate Orbiter entered the Martian atmosphere and burned up. This $125 million dollar mistake was down to inconsistent use of units between two different pieces of software controlling the satellite.

The Mars Climate Orbiter is not the only example of confusion over use of units being costly (the "Gimli Glider" is another) but it's probably the most expensive. Because of this it has become a textbook example for illustrating the importance of understanding your units of measurement and knowing how to use them correctly. Needless to say, consistency in the use of units and the clear recording of which units are used for later reference should be considered a key feature of a useful dataset.

Consistency is also important with the recording of values too. Painstakingly measuring and recording 99 values to five decimal places could be a massive waste of time if the hundredth value has been rounded to the nearest whole number.

Another consideration is whether the same basic procedure was used for every record or if things were changed part-way through. For demographic data it's important to know whether data from different countries or administrative bodies was recorded at around the same time or years apart (e.g. if you are comparing data from two censuses) and whether those values are really comparable (e.g. how does each country define who is a permanent and who is a temporary resident?).

Completeness

Closely related to consistency is completeness. Ideally you, or someone else, have collected every data point you planned to collect. But this target can be difficult to live up to in practice. Clearly, the first task is to determine whether all data was collected. This may or may not be a trivial task. In the event that some data is missing then the next task is to determine what to do about it. It may be that a few missing values is not a major problem or it may be disastrous.

Assuming you carry on with incomplete data, you need consider possible sources of bias. A census still tells us useful information even if it doesn't truly record every member of the population. But the characteristics of those missing may not match the characteristics of those present. And, if your data collection involved measuring the depth of a river and the measuring equipment was washed away when the river flooded, you can't just interpolate (as below) results from the measurement before the river flooded to the one afterwards when you've replaced your equipment and the river is back flowing normally.

Methods for dealing with data that isn't "missing at random" can be complex. If you're using someone else's dataset and there are only a few missing values it may be possible to do some fieldwork of your own to fill in the gaps. The result is likely an improvement in terms of completeness but a detriment to consistency.

Finally, one has to worry about sample size. Even if the sample as a whole is not majorly affected by a few missing values, smaller subsamples you were keen to analyze may be. In short, not all missing data points are necessarily of equal concern.

Size

Small datasets have very limited statistical power. We run the risk of trying to draw conclusions we just don't have enough data to support. And even if we understand the limitations of what we can really say from our little data, there's no guarantee that people outside the data team will. Nevertheless, with strict time or budgetary constraints (they often go hand-in-hand), that may be all you get.

Recent years have seen the rise in popularity of Big Data. Or at least we've seen the rise in popularity of the phrase "Big Data". Sometimes a Big Dataset is the only option. Still, while proponents will tell you of the virtues, critics say the term Big Data isn't well-defined or is just a marketing gimmick. And with so much data to play with, spurious correlations can become inevitable if due care and attention aren't paid.

My foremost concern with the buzz we've seen in recent years around Big Data is that it might encourage us to come at things from the wrong angle: rather than collecting as much data as is necessary, we end up collecting as much data as possible. There's no guarantee that piling up the data will pile up the insights (visualization can be a bit tricky too), but you may need to worry more about storage, protection and privacy.

As I've hinted at previously, I don't think we should be concerned whether we have "Big Data" but whether we have Big Enough Data.