How many changes should you A/B test?

Optimization (CRO) teams are often required to balance the earning and learning potential of changes. This post explores some strategies that teams can use.

5 min readSep 13, 2020

trade-off — Source: MIT Sloan Management Review

You have a list of hypotheses to test in your learning plan. As you start prioritizing and building out your roadmap, you realize that running all these tests could take long, really long.

Afterall, the duration of every test is based on the required sample size and it is additionally recommended that tests run for at least 1–2 business cycles. Also, only a small percentage of tests succeed so you’ll likely test a few iterations for each hypothesis. Add to that the fact that you should A/A test your winning experience to confirm the results and rule out instrumentation effect.

So, what do you do? You now have a few years worth of tests on your roadmap!

You test one change at a time.

When you test one change at a time, you are able to attribute the improvement to the change you made. It is great for learning about your users and can inform future decisions. However, too small a change may not move the needle and the test experience may perform on par with the control.

And even if there is a positive or negative lift, it may be small enough that you will need to run the test for long to have confidence in the results. To illustrate, for a test page that gets 10,000 Unique Visitors per week and has a 5% conversion rate, here are the test durations:

sample sizes and test durations for different levels of lift — Sample sizes and test durations for different levels of lift assuming one-tailed test, 80% power and 95% confidence level (Source: https://abtestguide.com/abtestsize/)

If your test experience sees a lift of 3–5%, it will need to run for 10–27 weeks! So, testing small changes such as button colors, font treatment, placement, images, sentences, words etc. is a luxury only afforded by websites with large traffic.

You test a few changes together.

If you test a few changes together, it is hard to attribute the change that actually moved the needle. You can see the test experience perform better, worse or same as the original experience. However, in all these cases, there may be some underlying changes that have a positive impact while others that have a negative impact, thus canceling each other out. So, in such tests, the learning suffers.

At the same time, when you swing for the fences by making bigger and bolder changes, there is a bigger likelihood of seeing a lift. Given the required sample sizes calculated above, you dramatically save time and can move to other tests on your roadmap.

So what do you do?

As Peep Laja, founder of CXL says, start by asking

“Are we in the business of science or the business of making money?”

If you get less than 100K monthly visitors, you need to swing for the fences and test bigger changes. Not in a reckless manner but in a data driven manner. But how do you continue to learn? Peep recommends the following strategies:

Ensure each change addresses a specific problem

Let me explain. You work for a subscription business which has a three step sign up flow — Personal Details, Plan Selection and Payment. On the Plan Selection page, a user has to choose from 5 different plans.

Through your funnel analysis, you observe that there is a sizable drop off on the plans selection page. You focus your efforts on trying to understand why. You conduct usability studies, survey or conduct in-depth interviews with users who abandoned sign-up and find that they find the page confusing. On further probing, you find two main problems:

Problem #1: Users find it hard to understand the differences between each of the plans.

Problem #2: Users are confused about how the pricing will work.

In such a case, Peep suggests tackling each problem separately. So, if solving problem #1 is likely to have a higher impact, start with that. Then identify the changes you can make here:

Improve layout and design to enable users to compare more easily
Simplify the copy to explain the plans better
Reduce the number of plans if there are plans that users are clearly not opting for

In this case, depending on the traffic to the page, you can test one, two or all three together as illustrated below:

Control experience with existing plans selection page

Vs.

Test experience with multiple changes made on plans selection page to solve one customer problem

All changes support the same hypothesis

Continuing with the same example above, suppose you now decide to tackle customer problem #2. Your hypothesis is that by showing per shipment pricing, users will understand pricing better and will be more likely to sign-up for the service. Pricing details are tackled on a few different pages — Pricing page, Plan Selection page and the Plan Summary on the Payment page.

So, you will need to make changes across three pages — pricing, plan selection and payment. Should we test them separately? No! Lack of consistency can confuse users and hurt more than help. A better approach will be to A/B test the new pricing details on these pages with the control as illustrated below:

Control experience with existing pricing, plan selection and payment pages

Vs.

Test experience with new pricing, plan selection and payment pages

Once you have better performing versions of the flow, you can then run multivariate tests (MVTs) to identify the impact of each change. Another option is to run an Existence Test where you remove a bunch of things from the page and A/B test against control. If the test experience performs worse than the control, you can attribute the impact to the removed content..

When scoping your tests, it is very important to weigh the earning potential with the learning potential. For websites not endowed with high traffic, inconclusive tests can totally clog the testing pipeline. So, it’s always better to test bigger changes backed by your data that can have a bigger impact.