Justifying Normality Assumption

Posted on


… the statistician knows … that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.

As a Statistics major, I have been taught for years to test for normality prior to applying procedures like Student's t-test, ANOVA, and linear regression, because they all reply on the normal assumption. In this post, I will question this practice.

To begin with, normality tests are designed to tell you if the data are from a normal distribution, but you have already know the answer: think about it, the normal distribution is continous, which means the probability of getting a rational number from it is exactly zero!

Failure to reject the null hypothesis means you could not find enough evidence against the normality assumption. However, absence of evidence is not evidence of absence[1]. To elaborate, one of three things might have happened:

  1. The data came from a perfect normal distribution.
  2. Your sample size is too small for the deviation from perfect normality to manifest itself.
  3. A type-II error occured, or if you prefer, the significance level α is too small.

The first possibility has been ruled out, and let's ignore the last one since it's intrinsic to all precedures concerning randomness. Now, if you look closely into the second probability, it unrolls into a spectrum. On one side, the distribution is not even nearly normal but you only have a miserable amount of data, whereas on the other side, you have an enormous sample but the underlying distribution is a t-distribution with 5000 degrees of freedom or a rounded normal distribution. Consequently, if your data pass a normality test at a certain significance level, then you know you are somewhere on the spectrum, but you have no idea about the exact location[2]. With sufficient data and arithmetic precision, all rational dataset will be found non-normal (pun intended). This isn't super helpful.

If normality test is not a fair justification for normality assumption, what should we use instead? I personally believe the answer is our experience and wisdom. Seriously, we must carefully investigate the data generating process in question, as opposed to relying on any automagical black-box procedure. For example, if you have learned that the response is the sum of many independent noises, the contribution of each of which to the total variance is tiny, then it could be reasonable to call out the Central Limit Theorem and assume normality. In addition, sometimes you only need "partial" normality, e.g. when conducting a t-test. In these cases, you can test for that particular property of normal distribution instead. See this excellent answer for more about that.

Statistics is a science based on assumptions like linearity, independence, and normality, but abstracting such assumptions from real world scenarios is an art, and arts are inherently subjective. More often than not, it's OK to not reach a consensus because the same evidence is not equally convincing to each mind: perhaps a Q-Q plot looks straight to Alice, but Bob thinks it's closer to a curve, or perhaps a significance level of 0.05 is enough for Bob, but Alice considers it too aggressive. If it's impossible to justify our assumptions to everyone, would it be a good idea to simply leave the judgements to the readers?

1. In fact, absence of evidence may be evidence of absence when you have conducted a power analysis. However, power analyses are effectively disregarded in all Statistics courses I have taken thus far. Maybe it's just me? It's tricky to analyze the power of a normality test anyway, because such analyses are done with respect to a certain alternative typothesis, but the alternative typothesis of a normality test (i.e. the distribution is not normal) is too vague to be useful.

2. Technically, you can somehow estimate the real distribution's departure from normality (which is called "effect size" in statistical jargon) with the sample size, but I don't think a lot of people are doing so in practice. I can see developing effect estimators for