PSA: Reading Statistics - Confession 54

2015.05.14 10:24:44

Index

Statistics and figures are often used to make arguments and provide evidence for certain points of view. Sadly, the overwhelming majority of these figures and graphs are misrepresentative, misleading, or outright wrong. Knowing what to look for in a statistic is extremely important in order to know whether you can trust its stance at all. This isn't going to be an entry about making statistics, but more about determining whether you can trust the data presented to you. I'll also explain some of the important terminology used to discuss statistics.

Sample Size

The first important part of any statistic is the sample size. A sample is a token of a whole, chosen at random in order to test some kind of hypothesis. Therefore, the sample size refers to how many such random tokens were picked to test against. In more basic terms, the sample size refers to how many different things you took a look at. If you're conducting a survey, that would be the number of people you surveyed. If you're testing for the water pollution it would be the amount of water you gathered to test.

The sample size is vital to know because it gives you an idea of how representative the conclusion is of the whole. If you have a million people, but only five of them were surveyed, then the conclusion will not be very likely to be representative of all million of them. Sample sizes need to be as large as possible, and it's easy to think that a small sample size is representative, while it might as well not be at all. The statistic could have very well just been unlucky or biased in the choice of samples. As a concrete example, surveys on a national level simply won't hold much weight with a mere thousand people surveyed. You'll need hundreds of thousands in the least.

Sample Selection

Aside from the number of samples, it's vital to know how these samples have been selected. After all, if there is a bias in the selection, practically any result desired will be possible to have by just picking whatever fits the idea. This is a very popular strategy in climate change denial, where only certain regions of time that fit the hypothesis are considered. In surveys you could only ask people that you know will agree with your viewpoint. Knowing how the samples have been selected is therefore vital to know whether the resulting data can be trusted at all.

Evaluating this will be a bit more tricky, since you cannot just rely on numbers, but instead need to look into how the statistic was put together and conducted. Generally it's important that the samples come from a large variety of sources. After all, they should be random. Usually a large variety in location, measurement technique, and time within the constraints of the experiment will be indicative of an unbiased result. If you notice that only certain specific points in time or location, or a specific way of measuring was used, you should be suspicious and look for more information that fills in those gaps. You might just prove the opposite of the original statistic!

Average and Median

The average and median are two ways of boiling down a set of samples to a single number. The average is computed by summing up all samples and dividing that by the number of samples. The median on the other hand is computed by sorting all samples by their magnitude and picking the middle one. Both are very valuable in giving a quick oversight. The average allows you to smoothen out the result of a noisy sample set. If there's a lot of tiny variation, the average will give you a good idea of what the standard is like. The median is important because it gives you a better idea when you have singularities in your samples. As an example, if you have the sample set (1 8 1 1000 5 2 3 4 2 1), the average will be 102.7 which is rather misrepresentative. The median however will be 2, which fits the given data much better.

Standard Deviation

The average and median can still give you a very skewed image. The standard deviation is the next vital piece of information you need. Sadly it's also the one that is omitted in almost all info graphics and newspapers. Without the standard deviation, the average and medians (and thus in effect the graphs) are practically meaningless.

Computing the standard deviation is a bit more complicated: Compute the average of your samples. For each sample, calculate the difference to the average and square the result. Next compute the average of these differences, which will give you the variance. The square root of the variance will then result in the standard deviation.

The standard deviation gives you an idea of how noisy the data set is and how much the samples stray away from each other. Thus, the higher the standard deviation the less accurate the average is, and thus the less trustworthy it is. Without the standard deviation you cannot know whether the graphs presented to you bear any weight at all, because the sample points might differ so often and lie so far apart from each other that there is no way to form an accurate conclusion.

Percentages

Be very wary of percentages, especially if you lack any of the above information. If you just ask three people on the streets, you could already make a claim like “One in three people ..” and be technically correct according to your statistic. Using percentages is a very, very popular way of pushing an agenda because it gives a misleadingly simple view on the results of a statistic. Always be sceptical about this, and first look at all of the above mentioned points before forming a conclusion on whether the presented results are accurate and trustworthy.

Correlation and Causation

This is related to the conclusions that can be formed based on a statistic. Often times you will see a claim that is based on the correlation of two statistics. The claim will say that because the two correlate, they must somehow be related by cause as well. This is a very nasty fallacy. Simply put, things are not that easy. Things can be correlated without any cause being between them whatsoever, or in the very least the cause relationship would be so complicated, intertwined, and influenced by other factors that it might very well not exist at all.

The cause relationship could also be the inverse of what is claimed. As an example, it has been found that open-mindedness is correlated to being well-read. While reading a lot could lead to being open-minded, it might also very well be true that being open-minded leads to reading more. But again, it could also be coincidental and caused by different factors that were not taken into account. However, the opposite of “correlation implies causation” is much more likely to be true. If two things are related by cause, it is very likely that they will also correlate in some fashion.

If you want some examples that illustrate this fallacy very easily, look at the google images result for “correlation causation”. You can make up completely ridiculous examples with this that are obviously unrelated, but similarly you can make very convincing-looking arguments. Be very wary of claims like this.

Conclusion

Simply be suspicious of any infographics you may encounter, especially on the net, even more so when the source is a newspaper or blog site and not a scientific paper, and absolutely so when there is no source linked at all. When you do have a statistic on your hand, try to get a hold of the above points of information before forming a conclusion on whether the statistic holds any merit at all.

If you're interested in learning more about how to evaluate whether a conclusion based on a statistic is sound, look into learning about argumentation and reasoning in general. I can recommend A Rulebook for Arguments as a quick start. I can promise you that learning to properly reason and argue will be a very worthwhile endeavour.

Written by shinmera