# Statistics Basics for A/B Testing: Part 3

## In the last few posts, I have introduced basic statistical concepts you need to know for A/B testing. In the last post of this series, I will be covering the concept of statistical power, types of errors and a brief on sample size calculation.

In my last post, we were able to reject the null hypothesis that “Starbucks coffee is not hotter than McDonalds” since the observed p-value of our z-test was lower than the 0.05 p-value we had set apriori.

We also learned that the p-value is the probability of observing results at least as extreme as those measured when the null hypothesis is true or due to random chance. If the observed p-value is less than alpha, then the results are statistically significant i.e. the probability of seeing these results due to random chance is very low.

When we set a 5% statistical significance level (or a 95% confidence level), and if the observed p-value is less than that, it simply means that the probability of observing these results due to random chance is less than 5 times if we were to repeat the experiment 100 times. So, the p-value essentially sets a threshold for the false positive rate that the experimenter or the researcher is willing to accept. This is also called Type I error.

Another type of error that we could encounter is rejecting the alternate hypothesis when there is indeed a difference or failing to reject the null hypothesis when we really should. This is called Type II error or false negative rate (or Beta).

The chance of Type II error can be reduced by increasing the statistical power (1 — Beta) of the experiment or the A/B test. Statistical power is the likelihood that an experiment will detect an effect when there is an effect to be detected. A good way to metaphorically understand the concept for statistical power is by thinking of a fishing net, as explained by Georgi Georgiev (analyticstoolkit.com and instructor at CXL Institute).

A test with low power (or high beta) will be able to detect only bigger effect sizes (or bigger fish) while a test with high power will be able to detect smaller effect sizes (or small fish) as well. If your test has low power, you are letting lots of winners go.

Here are four different possibilities resulting from conducting an experiment:

The two boxes marked “Correct decision” are when the measured results match with reality. So, if in the test, the new version outperforms the current version which is true in reality as well, then the test is accurate and is driving the right decision. However, when the new version outperforms the current version in the test but is not so in reality, we get a false positive. And when the new version underperforms the current version in the test but is not so in reality, we get a false positive.

The two errors can also be displayed visually as follows:

It is hard to completely avoid making these errors when you are experimenting. However, you can reduce the risk by testing on an adequate sample size and using >80% power and a confidence level of 90–95% as a rule of thumb. Larger sample sizes lead to reduced variance. So, calculate the required sample size before running the test.

There are lots of sample size calculators available for free (links below) which have made sample calculations an easy task without requiring a statistician (unless you are testing multiple variables or in other special cases). In addition to setting power and confidence level, you will be required to enter traffic, baseline conversions or conversion rate, number of variants, minimum effect size or lift you are hoping to detect and the direction of the effect (1-tailed vs. 2-tailed).