## How to translate Math into plain English

Hypothesis testing is probably the most rewarding stage of a statistical study. It is the point where you get to the conclusions, where you (can) find answers to the questions you were searching answers for. These should be in the form of simple yes/no questions. Hypothesis testing is the stage of a statistical study in which you are *translating math into English*.

Hypothesis testing is, most probably, what your boss wants from you if you are in the role of a statistician. If you go to him and just say something like: “µ1 = 135 and µ2 = 140”, then he/she may say: “OK. But… what do these numbers mean for my business? Is it good? Is it bad?”. It would be much better to tell him something like: “I found statistical evidence that your new marketing strategy brought us more profit than the previous one”.

At this point, of hypothesis testing, it is time for you to turn all those nasty numbers obtained so far from your dataset into easy to understand statements about your problem of interest.

### A comparison with Proof by contradiction

Hypothesis testing has some similarities with one of the simplest method of proof from mathematical logic: Proof by contradiction. Hypothesis testing can be thought of as being the probabilistic counterpart of proof by contradiction.

Let us refresh our memories first about this method. *How does proof by contradiction work?*

Let us say we want to prove a statement, call it A. When we use proof by contradiction, we start by assuming the opposite of A (denoted ~A or not A) as being true, and then we try to reach to a contradiction (a statement that is always false). If we reach a contradiction, then this means that our assumption that ~A is true, should be false, and therefore A should be true.

The idea of hypothesis testing is similar to that of proof by contradiction, but this time ~A is called the “null hypothesis” and denoted by H0, and A is called the “alternative hypothesis” and denoted by H1. Reaching a contradiction (that is, obtaining a result that **cannot be true**) is now reaching a result that **can be true with low probability**. How low this probability should be? An often-used value is 0.05; this threshold value is called a **significance level** and is denoted by α. So, if the probability of your observed data given the null hypothesis is ≤ α, then you consider this probability being too low for H0 to be true; so, you will *reject* H0 and therefore *accept* H1. In the case of hypothesis testing, instead of **proving H1** (that is, with 100% confidence; this is the case for mathematical logic proofs), you are **1-α confident that H1 is true**.

Note that we can accept only H1, not H0. Just because we may fail to reject H0 sometimes, this does not mean that we accept it. As in the case of proof by contradiction, just because we may not have a good idea of how to reach a contradiction once we assumed ~A to be true, this does not necessarily mean that ~A is true.

Concluding that H0 is true based on the initial assumption that H0 is true would be a paradox, therefore we can only reject H0.

### Statistical formulation

For doing hypothesis testing we need to make some assumptions first: *we assume our observations X1, X2, …, Xn to be i.i.d. (independent and identically distributed) with a probability distribution **Pθ**; where θ is a fixed but unknown parameter (it can also be a vector of more scalar parameters) belonging to ϴ (capital theta; the set of all possible parameters for **Pθ** family of probability distributions).*

In the context of hypothesis testing, we want to make statements about the parameter θ. For that, we take 2 disjoint subsets of ϴ: ϴ0 and ϴ1, and then consider the following 2 hypotheses:

By convention, H0 is the one that should be rejected, and H1 is the one that we want to accept.

As an example consider the following scenario:

You want to study if university graduates end up having the job they wanted before applying to university, or they tend to change their minds? For that, you go out there and collect some data about randomly selected people that graduated more than 5 years ago and for each of them you write down a 1 if he/she has the job they wanted before university, or a 0 otherwise. Then you consider this data as being i.i.d. Bernoulli(p) and want to test whether p > ½ or not. So, we have these hypotheses:

So, how do we decide whether to reject the null hypothesis or not? We use a *random variable to make this decision*. This r.v. (random variable) is a function of our data that outputs only 0 (do not reject) or 1 (reject). For a fixed parameter θ, this r.v. has a Bernoulli distribution. *This r.v. that we use for making the decision is called a **test**, and is denoted by Ψ.*

A test has usually one of the following forms:

Or:

Where: Ψ is a function of the observed data X1, …, Xn, and the notation **1{.}** is an indicator random variable that takes value 1 when the expression inside braces is true, and 0 otherwise. Tn is a random variable called a **test statistic**, and c is a constant.

The first form is called a **one-sided test** because the test is 1 only when Tn is on the right side of the constant c. The second form is called a **two-sided test** because the test can be 1 when Tn is on both sides of c; either Tn > c or Tn < -c will result in the test being 1.

### Errors of a test

As you can see above, our test depends on the data that we observe, which is random and we model it as r.v.s X1, …, Xn. So, the result of our test is, well …, random. We have no guarantee that it will always give us a 1 when we should reject H0, or that it will always give us a 0 when we should not reject H0. Whatever test we may choose, it can give us some errors.

There are 2 types of errors that may occur in hypothesis testing:

**Type I error** occurs when we **reject H0**, but in reality, it is true. **Type II error** is when we **fail to reject H0**, but it is actually false.

The probabilities of both type I and type II errors depend on the true parameter θ, so they are functions of θ.

Probability of type I error, denoted by αΨ:

This is the probability that our test indicates 1 when the true θ belongs to ϴ0. Each different θ from ϴ0 can give us a different probability of type I error.

Probability of type II error, denoted by βΨ:

This is the probability that our test indicates 0 when the true θ belongs to ϴ1. Each different θ from ϴ1 can give us a different probability of type II error.

### Significance level & p-values

We want our test to have a small probability of error. So, we want to choose the test statistic Tn and the constant c in such a way that minimizes the probability of each of these 2 types of errors. But minimizing both types of errors at the same time is a hard problem. In the worst case in which the true parameter θ is at the boundary between ϴ0 and ϴ1, the probabilities of the 2 types of errors are complementary; when one is small, the other is large and there’s no way we can make both of them small.

What we do instead is to prioritize type I error. We set an upper bound on the probability of type I error such that we also get a reasonable type II error.

Why we prioritize type I error? Type I error is when H0 is true and we reject it. In hypothesis testing, rejecting H0 is the thing that leads to “discovery”, to proving our initial guess (H1). *A big type I error means that we prove many false things.* And we do not want that. We prefer to conclude H1 less often, but when we do so, to be pretty confident that what we concluded is actually true.

The upper bound that we fix for type I error is called the **significance level** of our test, and is denoted by α, therefore:

*The smallest level α at which the test Ψ rejects H0 is called the **p-value** of the test Ψ.*

This may sound a little confusing at first. What is the difference between α and p-value? α is a number that *you choose*, you fix it ahead of time. You can choose α before collecting any data or doing any experiment. A p-value is computed based on the observed data, it is the smallest number that you can fix for α so that you are still able to reject H0 after you compute the test based on the observed data. The smaller the p-value, the more confident you are to reject H0.

### 2-sample test

Based on the formula of a test that we showed previously:

we may be tempted to think that what we discussed so far applies only when we want to compare a parameter of a single sample to a *constant*, and that Tn should be some kind of an estimator for that parameter.

What if we do not want to test how the parameter of some sample compares to a *constant*, but instead test how it compares to the parameter of *another sample*? Do we need to come up with another formula for a test? No. We can use the same formulation of a test and keep valid all we discussed so far. The only thing that we should do is to impose a particular form on Tn:

Where Un and Vn are computed based on different samples. By using this Tn, we can test if Un is greater than Vn. If you are interested only in the absolute difference between Un and Vn, just put them in absolute value and use a 2-sided test.

### Using Central Limit Theorem for Hypothesis Testing

So far, we discussed hypothesis testing just in general, without describing a specific way of doing it. In the formula of a test that we have shown, Tn can be any random variable. Any weird and non-intuitive random variable that is based on our data and can reject H0 with at most α probability of error will be a good candidate for a test statistic.

However, a commonly used approach is to use Central Limit Theorem (CLT) to do hypothesis testing. If we use an average of i.i.d. random variables (our data) as an estimator of the parameter that we want to test, then CLT implies that this average will have a normal distribution as the sample size n → ∞. Then we take this quantity and standardize it (subtract mean and divide by standard deviation) and consider this as our test statistic Tn. A Tn computed in this way will have a *standard normal distribution*. This is a well-known probability distribution that we can then use to determine the constant c that we need for our test in order to meet our chosen level α.

Where:

- µ is the expected value of Xi; this is typically assumed in the null hypothesis
- σ is the standard deviation of Xi; this is typically an estimate based on our sample that should be very close to the true standard deviation as n goes to infinity.
- that arrow with a “d” above means
*convergence in distribution*. - N(0, 1) denotes a standard normal distribution

Below is an image of this in the one-sided case:

The constant c is the threshold value between the rejection region (colored in blue) and the non-rejection region. If Tn > c (Tn falls in the rejection region) we reject H0, otherwise we fail to reject. That c is computed based on the chosen value for α; it is calculated so that the area of the rejection region is α. We can find c either by using a table or a calculator/software package.

A p-value, in this case, is the area of the region to the right of the value we obtain for Tn. We can also test for Tn < -c, in which case things would be similar, but the rejection region would be to the left of -c.

Below is illustrated a two-sided test:

In the case of a 2-sided test, things would be similar, but α is now split between the 2 sub-regions: Tn < -c and Tn > c. Each one of these 2 regions has an area of α/2.

If you have a large enough sample size (say, of a few hundred), then your test statistic Tn will be very close to a standard normal distribution, and the sample variance will be very close to the true variance. Therefore, this method is one of the easiest and best ways to do hypothesis testing.

### Student’s t-test

The downside of approximating Tn by a standard normal is that it will be inaccurate for small sample sizes.

There are mainly two sources of errors in the case of a small sample:

- The shape of the distribution of X̅n may not get close enough to a normal one
- The estimated variance, σ² hat, of Xi may not get close enough to the true variance of Xi, σ²

So, what we can do if we have a relatively small sample size?

One method that we can use to overcome the problem of estimating the variance is the **Student’s t-test**. Student’s t-test is a test in which Tn follows a Student’s t-distribution (the name Student comes from the fact that it was first published under the pseudonym “Student” by William Sealy Gosset in 1908).

The t-distribution is similar to a standard normal one, but it has as parameter something called a **degree of freedom (df)**, which is just the sample size minus one (df = n-1). When df is large (in the order of hundreds) a t-distribution is pretty close to a standard normal distribution. As df gets smaller, the bell-shaped curve starts flattening a little and letting more probability under its tails. As a consequence, when we have a small sample, we need a bigger value for Tn to reject H0 in order to account for the errors that may arise from variance estimation.

Theoretically, for Tn to have a t-distribution, the random variables X1, …, Xn need to be normally distributed. But, in practice, the t-distribution should model well-enough the test statistic Tn if n > 30 (the rule of thumb in CLT) or if X1, …, Xn are close to a normal distribution.

Where:

- σ² hat is the Bessel-corrected sample variance. We divide by n-1 in order to obtain an
**unbiased**estimator. If we follow our intuition and divide just by n, that estimator will be biased.

T(n-1) is the t-distribution with n-1 degrees of freedom.

### An example

Let us come back to an example stated a little above: “We want to study if university graduates end up having the job they wanted before applying to university, or they tend to change their mind?”

We will model our data as n i.i.d. Bernoulli random variables with parameter p. Each one of these variables will have a value of 1 if that person has the job, he/she wanted before university, or 0 otherwise. The parameter p represents the probability of getting a 1.

We want to conclude that most people change their mind. In terms of our assumed model that is: p < ½. So, we will state our hypotheses as follows:

For this hypothesis test, we want to use a significance level of α = 0.05.

I do not have any real data for this, so I will just come up with some numbers as an example. Let us say the sample size is 81, sample mean is 0.4 and sample variance is 0.24. For this sample size, we want to use a Student’s t-test.

Now let us assume H0 to be true. This means that p ≥ ½. But we need only one value for p to use in computing Tn. We will choose the value of p from the interval [1/2, 1] that can lead to the biggest error: p = ½. We want to be cautious, that is why we assume the worst value of p for our test. In this way, if we reject, we can be really confident of our result.

Now, let us compute the test statistic:

To see if this test statistic falls in the rejection region we will use a t-distribution table (like the one here):

We look in the table’s header (on the line where it says “one-tail” because we are doing a one-sided test) for our chosen α, in our case 0.05. Then we look for the row with our degree of freedom (80). The value that we find in the table at the intersection of our previous found column and row is the value of our constant c that is the threshold between rejection and non-rejection regions; in our case this value is 1.664. But these tables are made considering the rejection region in the right. In our problem, the rejection region is in the left side. Because the t-distribution is symmetric about 0 it does not matter that much, we just take the negative of what we find in the table. So, c = -1.664 is our delimiter. If Tn < -1.664 we reject H0, otherwise we fail to reject.

Our test statistic was Tn ≈ -1.84 < -1.664, therefore we reject the null hypothesis and accept the alternative hypothesis H1. So, we conclude that p < ½. In English, this would be:* “We have statistical evidence that most people end up having other jobs than their initial thought before university”*.

In our case, the p-value is the area under the t-distribution curve where t ≤ -1.84. For computing this quantity you can use an online calculator like the one here.

Our p-value is 0.0347 and this is the smallest α for which we could reject H0.

*I hope you found this information useful and thanks for reading!*

This article is also posted on Medium here. Feel free to have a look!

[…] Hypothesis Testing demystified […]

[…] Hypothesis Testing demystified […]