Five standard deviations are cute.
However, Tommaso Dorigo wrote the first part of his two-part "tirade against the five sigma",
Demistifying The Five-Sigma Criterion
I mostly disagree with his views. The disagreement begins with the first word of the title ;-) that I would personally write as "demystifying" because what we're removing is mystery rather than mist (although the two are related words for "fog").
He "regrets" that the popular science writers tried to explain the five-sigma criterion to the public – I think they should be praised for this particular thing because the very idea that the experimental data are uncertain and scientists must work hard and quantitatively to find out when the certainty is really sufficient is one of the most universal insights that people should know about the real-world science.
When I was a high school kid, I mostly disliked all this science about error margins, uncertainties, standard deviations, noise. This sentiment of mine must have been a rather general symptom of a theorist. Error margins are messy. They're the cup of tea of the sloppy experimenters while the pure and saint theorist only works with perfect theories making perfect predictions about the perfectly behaving Universe.
Of course, sometimes early in the college, I was forced to get dirty a little bit, too. You can't really do any empirical research without some attention paid to error margins and probabilities that the disagreements are coincidental. As far as I know, the calculations of standard deviations was one of the things that I did not learn from any self-studies – these topics just didn't previously look beautiful and important to me – and the official institutionalized education system had to improve my views. The introduction to error margins and probabilistic distributions in physics was a theoretical introduction to our experimental lab courses. It was taught by experimenters and I suppose that it was no accident because they were more competent in this business than most of the typical theorists.
At any rate, I found out that the manipulations with the probability distributions were a nice and exact piece of maths by themselves – even though they were developed to describe other, real things, that were not certain or sharp – and I enjoyed finding my own derivations of the formulae (the standard deviations for the coefficients resulting from linear regression were the most complex outcomes of this fun research).
At any rate, hypotheses predict that a quantity \(X\) should be equal to \(\bar X\pm \Delta X\) if I use simplified semi-laymen's conventions. The error margin – well, the standard deviation – \(\Delta X\) is never zero because our knowledge of the theory, its implications, or the values of the parameters we have to insert to the theory are never perfect.
Similarly, the experimenters measure the value to be \(X_{\rm obs}\) where the subscript stands for "observed". The measurement also has its error margin. The error margin has two main components, the "statistical error" and the "systematic error". The "total error" for a single experiment may always be calculated (using the Pythagorean theorem) as the hypotenuse of the triangle whose legs are the statistical error and the systematic error, respectively.
The difference between the statistical error and the systematic error is that the statistical error contains all the contributions to the error that tend to "average out" when you're repeating the measurement many times. They're averaging out because they're not correlated with each other so about one-half of the situations are higher than the mean and one-half of them are lower than the mean etc. and most of the errors cancel. In particular, if you repeat the same dose of experiments \(N\) times, the statistical error decreases \(\sqrt{N}\) times. For example, the LHC has to collect many collisions because the certainty of its conclusions and discoveries is usually limited by the "statistics" – by their having an insufficient number of events that can only draw a noisy caricature of the exact graphs – so it has to keep on collecting new data. If you want the relative accuracy (or the number of sigmas) to be improved \(K\) times, you have to collect \(K^2\) times more collisions. It's that simple.
On the other hand, the systematic error is an error that always stays the same if you repeat the experiment. If the CERN folks had incorrectly measured the circumference of the LHC to be 27 kilometers rather than 24.5 kilometers, this will influence most of the calculations and the 10% error doesn't go away even after you perform quadrillions of collisions. All of them are affected in the same way. Averaging over many collisions doesn't help you. Even the opinions of two independent teams – ATLAS and CMS – are incapable of fixing the bug because the teams aren't really independent in this respect as both of them use the wrong circumference of the LHC. (This example is a joke, of course: the circumference of the LHC is known much much more accurately; but the universal message holds.)
When you're adding error margins from two "independent" experiments, like from the ATLAS collisions and the CMS collisions, you may add the statistical errors for "extensive" quantities (e.g. the total number of all collisions or collisions of some kind by both detectors) by the Pythagorean theorem. It means that the statistical errors in "intensive quantities" (like fractions of the events that have a property) decreases as \(1/\sqrt{N}\) where \(N\) is the number of "equal detectors". However, the systematic errors have to be added linearly, so the systematic errors of "intensive" quantities don't really drop and stay constant when you add more detectors. Only once you calculate the total systematic and statistical errors in this non-uniform way, you may add them (total statistical and total systematic) via the Pythagorean theorem (physicists say "add them in quadrature").
So far, all the mean values and standard deviations are given by universal formulae that don't depend at all on the character or shape of the probabilistic distribution. For a distribution \(\rho(X)\), the normalization condition, the mean value, and the standard deviation are given by\[
\eq{
1 & = \int dX\,\rho(X) \\
\bar X &= \int dX\,X\cdot \rho(X) \\
(\Delta X)^2 &= \int dX\,(X-\bar X)^2\cdot\rho (X)
}
\] Note that the integral \(\int dX\,\rho(X)\) with the extra insertion of any quadratic function of \(X\) is a combination of these three quantities. The Pythagorean rules for the standard deviations may be shown to hold independently of the shape of \(\rho(X)\) – it doesn't have to be Gaussian.
However, we often want to calculate the probability that the difference between the theory and the experiment was "this high" (whether the probability is high enough so that it could appear by chance) – this is the ultimate reason why we talk about the standard deviations at all. And to translate the "number of sigmas" to "probabilities" or vice versa is something that requires us to know the shape of \(\rho(X)\) – e.g. whether it is Gaussian.
There's 32% risk that the deviation from the central value exceeds 1 standard deviation (in either direction), 5% risk that it exceeds 2 standard deviations, 0.27% that it exceeds 3 standard deviations, 0.0063% that it exceeds 4 standard deviations, and 0.000057% which is about 1 part in 1.7 million that it exceeds five standard deviation.
So far, Dorigo wrote roughly four criticisms against the five-sigma criterion:
five sigma is a pure convention
the systematic errors may be underestimated which results in a dramatic exaggeration of our certainty (we shouldn't be this sure!)
the distributions are often non-Gaussian which also means that we should be less sure than we are
systematic errors don't drop when datasets are combined and some people think that they do
You see that this set of complaints is a mixed bag, indeed.
Concerning the first one, yes, five sigma is a pure convention but an important point is that it is damn sensible to have a fixed convention. Particle physics and a few other hardcore hard disciplines of (usually) physical sciences require 5 sigma, i.e. the risk 1 in 1.7 million that we have a false positive, and that's a reasonably small risk that allows us to build on previous experimental insights.
The key point is that it's healthy to have the same standards for discoveries of anything (e.g. 1 in 1.7 million) so that we don't lower the requirements in the case of potential discoveries we would be happy about; the certainty can't be too small because the science would be flooded with wrong results obtained from noise and subsequent scientific work building on such wrong results would be ever more rotten; and the certainty can't ever be "quite" 100% because that would require an infinite number of infinitely large and accurate experiments and that's impossible in the Universe, too.
We're gradually getting certain that a claim is right but this "getting certain" is a vague subjective process. Science in the sociological or institutionalized sense has formalized it so that particle physics allows you to claim a discovery once your certainty surpasses a particular thresholds. It's a sensible choice. If the convention were six sigma, many experiments would have to run for a time longer by 35% or so before they would reach the discovery levels but the qualitative character of the scientific research wouldn't be too different. However, if the standard in high-energy physics were 30 sigma, we would still be waiting for the Higgs discovery today (even though almost everyone would know that we're waiting for a silly formality). If the standard were 2 sigma, particle physics would start to resemble soft sciences such as medical research or climatology and particle physicists would melt into stinky decaying jellyfish, too. (This isn't meant to be an insulting comparison of climatology to other scientific disciplines because this comparison can't be made at all; a more relevant comparison is the comparison of AGW to other religions and psychiatric diseases.)
Concerning Tommaso's second objection, namely that some people underestimate systematic errors, well, he is right and this blunder may shoot their certainty about a proposition through the roof even though the proposition is wrong. But you can't really blame this bad outcome – whenever it occurs – on the statistical methods and conventions themselves because you need some statistical methods and conventions. You must only blame it on the incorrect quantification of the systematic error.
The related point is the third one, namely that the systematic errors don't have to be normally distributed (i.e. with a distribution looking like the Gaussian). When the distribution have thick tails and you have several ways to calculate the standard deviations, you should better choose the largest one.
However, I need to say that Tommaso heavily underestimates the Gaussian, normal distribution. While he says that it has "some merit", he thinks that it is "just a guess". Well, this sentence of his is inconsistent and I will explain a part of the merit below – the central limit theorem that says that pretty much any sufficiently complicated quantity influenced by many factors will be normally distributed.
Concerning Tommaso's last point, well, yes, some people don't understand that the systematic errors don't become any better when many more events or datasets are merged. However, the right solution is to make them learn how to deal with the systematic errors; the right solution is not to abandon the essential statistical methods just because someone didn't learn them properly. Empirical science can't really be done without them. Moreover, while one may err on the side of hype – one may underestimate the error margins and overestimate his certainty – he may err on the opposite, cautious side, too. He may overstate the error margins and \(p\)-values and deny the evidence that is actually already available. Both errors may turn one into a bad scientist.
Now, let me return to the Gaussian, normal distribution. What I want to tell you about – if you haven't heard of it – is the central limit theorem. It says that if a quantity \(X\) is a sum of many (\(M\to\infty\)) terms whose distribution is arbitrary (the distributions for individual terms may actually differ but I will only demonstrate a weaker theorem that assumes that the distributions coincide), then the distribution of \(X\) is Gaussian i.e. normal i.e. \[
\rho(X) = C\exp\zav{ - \frac{(X-\bar X)^2}{2(\Delta X)^2} }
\] i.e. the exponential of a quadratic function of \(X\). If you need to know, the normalization factor is \(C=1/(\Delta X)\sqrt{2\pi}\). Why is this central limit theorem true?
Recall that we are assuming\[
X = \sum_{i=1}^M S_i.
\] You may just add some bars (i.e. integrate both sides of the equation over \(X\) with the measure \(dX\,\rho(X)\): the integration is a linear operation) to see that \[
\bar X = \sum_{i=1}^M \bar S_i.
\] It's almost equally straightforward (trivial manipulations with integrals whose measure is still \(dX\,\rho(X)\) or similarly for \(S_i\) and that have some extra insertions that are quadratic in \(S_i\) or \(X\)) to prove that\[
(\Delta X)^2 = \sum_{i=1}^M (\Delta S_i)^2
\] assuming that \(S_i,S_j\) are independent of each other for \(i\neq j\) i.e. that the probability distribution for all \(S_i\) factorizes to the product of probability distributions for individual \(S_i\) terms. Here we're assuming that the error included in \(S_i\) is a "statistical error" in character.
So the mean value and the standard deviation of \(X\), the sum, are easily determined from the mean values and the standard deviations of the terms \(S_i\). These identities don't require any distribution to be Gaussian, I have to emphasize again.
Without a loss of generality, we may linearly redefine all variables \(S_i\) and \(X\) so that their mean values are zero and the standard deviations of each \(S_i\) are one. Recall that we are assuming that all \(S_i\) have the same distribution that doesn't have to be Gaussian. We want to know the shape of the distribution of \(X\).
An important fact to realize is that the probabilistic distribution for a sum is given by the convolution of the probability distributions of individual terms. Imagine that \(X=S_1+S_2\); the arguments below hold for many terms, too. Then the probability that \(X\) is between \(X_K\) and \(X_K+dX\) is given by the integral over \(S_1\) of the probability that \(S_1\) is in an infinitesimal interval and \(S_2\) is in some other corresponding interval for which \(S_1+S_2\) belongs to the desired interval for \(X\). The overall probability distribution is given by \[
\rho(X_K) = \int dS_1 \rho_S(S_1) \rho_S(X_K-S_1).
\] You should think why it's the case. At any rate, the integral on the right hand side is called the convolution. If you know some maths, you must have heard that there's a nice identity involving convolutions and the Fourier transform: the Fourier transform of a convolution is the product of the Fourier transforms!
So instead of \(\rho(X_K)\), we may calculate its Fourier transform and it will be given by a simple product (we return to the general case of \(M\) terms immediately)\[
\tilde \rho(P) = \prod_{i=1}^M \tilde\rho(T_i).
\] Here, \(P\) and \(T_i\) are the Fourier momentum-like dual variables to \(X\) and \(S_i\). However, now we're almost finished because the products of many (\(M\)) equal factors may be rewritten in terms of an exponential. If \(\rho(T_i)=\exp(W_i)\), then the product of \(M\) equal factors is just \(\exp(MW_i)\) and the funny thing is that this becomes totally negligible if \(MW_i\gg 1\). So we only need to know how the right hand side behaves in the vicinity of the maximum of \(T_i\) or \(W_i\). A generic function \(W_i\) may be approximated by a quadratic function over there which means that both sides of the equation above will be well approximated by \(C_1\exp(-MC_2 T_i^2)\) for \(M\to\infty\).
It's the Gaussian and if you make the full calculation, the Gaussian will inevitably come out as shifted, stretched or shrunk, and renormalized so that \(X\), the sum, has the previously determined mean value, the standard deviation, and the probability distribution for \(X\) is normalized. Just to be sure, the Fourier transform of a Gaussian is another Gaussian so the Gaussian shape rules regardless of the variables (or dual variables) we use.
So there's a very important – especially in the real world – class of situations in which the quantity \(X\) may be assumed to be normally distributed. The normal distribution isn't just a random distribution chosen by some people who liked its bell-like shape or wanted to praise Gauss. It's the result of normal operations we experience in the real world – that's why it's called normal. The more complicated factors influencing \(X\) you consider, and they may be theoretical or experimental factors of many kinds, the more likely it is that the Gaussian distribution becomes a rather accurate approximation for the distribution for \(X\).
Whenever \(X\) may be written as the sum of many terms with their error margin (even though the inner structure of these terms may have a different, nonlinear character etc.; and the sum itself may be replaced by a more general function because if it has many variables and the relevant vicinity of \(X\) is narrow, the linearization becomes OK and the function may be well approximated by a linear combination i.e. effectively a sum, anyway), the normal distribution is probably legitimate. Only if the last operation to get \(X\) is "nonlinear" – if \(X\) is a nonlinear function of a sum of many terms etc. or if you have another specific reason to think that \(X\) is not normally distributed, you should point this fact out and take it into account.
But Tommaso's fight against the normal distribution as the "default reaction" is completely misguided because pretty much no confidence levels in science could be calculated without the – mostly justifiable – assumption that the distribution is normal. Tommaso decided to throw the baby out with the bathwater. He doesn't want an essential technique to be used. He pretty much wants to discard some key methodology but as a typical whining leftist, he has nothing constructive or sensible to offer for the science he wants to ban.