Statistics Basics for A/B Testing: Part 2
In my last post, I covered some basic concepts important for A/B testing such as population, sample, measures of central tendency, standard deviation and confidence intervals. I will continue discussing other statistical concepts important to A/B testing in this post.
A hypothesis refers to the researcher’s initial belief about the situation before the study. This initial theory is known as the alternative hypothesis and the opposite is known as the null hypothesis. The null hypothesis is typically the “accepted fact”.
Hypothesis testing allows us to determine which theory, the null or alternative, is better supported by the evidence. So, we are basically testing whether the results are valid by figuring out the odds that the results have happened by chance. If the results have happened by chance, the experiment won’t be repeatable and so has little use.
Hypothesis testing is an important method of statistical inference and is widely used in a variety of studies — from medical trials to assess drug effectiveness to observational studies evaluating exercise plans to even randomized controlled experiments (aka A/B testing). What all studies have in common is that they are concerned with making comparisons, either between two groups or between one group and the entire population.
Continuing with the coffee cups temperature example from my previous post, let’s say that Starbucks makes the claim that their coffee is hotter than McDonalds. Given what we know about hypotheses statements, our hypotheses in this case will be:
Alternative Hypothesis: Starbucks coffee is hotter than McDonalds.
Null Hypothesis: Starbucks coffee is not hotter than McDonalds.
Since we can’t measure all the coffee cups at both places without some expensive fancy gizmo, we will use samples and hence, hypothesis testing to confirm if this is really the case.
We take a sample of 50 cups of coffee each from Starbucks and McDonalds. That sure is a lot of caffeine in our imaginary test! When we plot the temperatures of each of the coffee cups from Starbucks, we get a histogram that resembles a bell.
Here is the histogram for McDonald’s:
The y-axis shows the frequency of observing each temperature value. The values in the middle (near the mean) are more likely to be observed than the values at the extreme. So, we are more likely to observe coffee cup temperatures to be around 178ºF than 168ºF or 192ºF for Starbucks. This curve is called a Normal Distribution.
Typically most data we see around us — weights, heights, scores, IQ levels etc. — is distributed in the shape of a bell. So normal distribution is a very common and easy way when thinking of your data. Normal distributions can be described using only two values — Mean and Standard Deviation. Mean tells us the value most likely to occur, while standard deviation tells us how spread out the other data points or observations are, i.e. variability in the data.
We will use z-test for our hypothesis testing. Our next step is to standardize the distribution by transforming it into a z distribution. A z-score gives an idea of how far from the mean a data point is. But more technically it is a measure of how many standard deviations below or above the population mean a raw score is.
Here is the z-score formula where x̄ is the sample mean and s is the sample standard deviation.
The higher or lower the z-score, the more unlikely the result is to happen by chance and the more likely the result is meaningful. As the next step, we select statistical significance or p-value.
A p-value is the probability of observing results at least as extreme as those measured when the null hypothesis is true. Whether or not the result can be called statistically significant depends on the p-value (known as alpha) we establish for significance before we begin the experiment . If the observed p-value is less than alpha, then the results are statistically significant.
The choice of alpha depends on the situation and the field of study, but the most commonly used value is 0.05, corresponding to a 5% chance the results occurred at random. As an extreme example, the physicists who discovered the Higgs Boson particle used a p-value of 0.0000003, or a 1 in 3.5 million chance the discovery occurred because of noise. In a business scenario, researchers may choose a higher level of statistical significance or alpha.
We now run the “z-Test: Two Sample for Means” to get from a z-score on the normal distribution to a p-value. In MS Excel, it is a part of the “Analysis Toolpak” add-in under “Data Analysis”. More details on how to calculate using excel here. We can also do it manually by using the following formula:
From Excel, we get the following output:
The p-value is less than the statistical significance we had selected apriori. So, this means that Starbucks coffee is hotter than McDonald’s and we can reject the null hypothesis.