Connect the Dots

Tim Brock / Tuesday, July 7, 2015

Hopefully you're familiar with the standard scatterplot - a selection of points where each x coordinate is determined by one variable and each y coordinate by another. Below is a simple example, showing the number of countries winning one or more medals versus the total number of medals awarded for all Summer Olympic Games from Athens in 1896 to London in 2012. The data was assembled from Wikipedia's articles on Olympic medal tables (such as this one for the 1896 games).

As one might expect, we see a positive correlation between the number of medals given out and the number of different countries winning medals. (For those interested in such things, the correlation coefficient is 0.95.) It's not exactly a groundbreaking chart that will make you reconsider your view on the Olympic Games.

You may suspect that the points near the bottom left — a few countries winning a relatively small number of medals — represent early games and the points near the top right represent more recent games. And you'd be right. We can directly label points to illustrate this.

We can now see, for instance, that the point furthest to the left in the 1896 Olympics while the two points at the top-right are 2008 and 2012, the two most recent events. We can also see that there were bigger changes between 1988 and 2008 than there was between 1948 and 1968. But the labels are the same color as the points and, as such, the actual positions of the points get lost somewhat. Using a lighter gray color for the text helps the labels to fade into the background.

It should be clear that these points form a time series. We could, in principle, plot this series in 3D. But interpreting three-dimensional lines on a two-dimensional screen is difficult. We can still join the points together on the 2D plane we used previously though to create a "connected scatter plot". This is done below; I've moved many of the year labels and removed a few altogether to stop them getting in the way. With the strong line now being the focus I think it's OK to use black text for year labels again.

While the data is the same, the focus of the chart has changed somewhat. We're no longer concentrating on the correlation between medals awarded and the number of countries winning medals. That was never likely to give us any particularly remarkable insight anyway. Instead we now see a general progression with time (despite time not being plotted directly). Moreover, we also see the games that do not conform to the progression. We might wonder why so few countries won medals at the 1904 Olympics, what happened in 1976 and 1980 or why the number of medal-winning nations increased so dramatically between 1988 and 1996? The answers to these questions actually lie in the geographical and political history of the games.

I could, at this point, try and explain in paragraphs what happened. Instead, I'll use some annotations to succinctly describe broad/partial causes of some of the interesting changes in the series. It should be noted that the data used isn't necessarily the best data to unambiguously highlight an historical event I'll describe. For example, number of countries competing rather than the number winning a medal might provide a clearer visual indication that boycotts took place in the 1970's and 80's. For this example however, the annotations are meant to inform us about the data, not the other way around.

We've moved from what I would generally refer to as a chart to what I might label an infographic (though the distinction is probably not all that relevant). And while scatter plots are frequently seen in scientific literature, infographics are more popular in newspapers. And that's where some of the best examples of connected scatter plots can be found. Alberto Cairo wrote about several examples he was fond of a few years ago. I particularly like this one by Hannah Fairfield, showing how and why the auto fatality rate in the USA has (mostly) decreased despite an increase in the number of miles driven.

None of the above is intended to suggest that we should replace all out traditional scatter plots with connected scatter plots. Even if time is relevant to the data, an (unconnected) scatter plot may well still be the best visual representation for the task at hand. The general relationship between the two directly plotted variables is clearest in the first example and subtle labeling of points may be all that's need to describe a third variable. But in some cases, connecting points in order may help to convey a richer and more interesting story.

Want to build your desktop, mobile or web applications with high-performance controls? Download Ultimate Free trial now and see what it can do for you!