Introduction to A/B testing: A Practical Guide
This article represents the first third of a recent white paper on A/B testing that Jetlore has published. You can access the entire white paper now for free (email registration required).
Nearly every CRM, Product, and Email Marketing team we know utilizes A/B testing to assess the impact of potential Martech investments as well as other significant changes they are contemplating. Given the essential nature of A/B testing in marketing today, we thought it would be helpful to offer a practical guide that provides a background in the crucial elements of testing which include:
In this blog (1 of 3) we will cover:
- A review of the underlying math and statistics
- Methods to ensure that you are performing a statistically significant test
In the second blog, (2 of 3) we will:
- Look at the e-commerce variables that have significant impact and how to manage them (like AOV)
In the third blog, (3 of 3) we will:
- Discuss techniques to carefully manage user splits
Throughout, we will be using a practical example that pulls these elements together.
A/B Testing: A Practical Guide (Part 1)
A/B testing is now almost a required skill for marketers to evaluate the efficacy of different approaches to business. Whether testing consumer response to e-commerce functionality, or a CRM manager testing the effect of a new program for customers, A/B testing of results is a foundational element of decision making.
While there is a good amount of published content and literature on the basics of A/B testing, we thought it would be important to offer a practical guide on the fundamental principles that underpin A/B testing and significance of the results and describe methods for executing statistically sound testing.
This article covers topics relevant to A/B testing in e-commerce, email marketing, customer relationship management (CRM) and general consumer responses to offerings.
The basic idea behind A/B testing is to compare two versions which are identical except for one variation that might affect a user's behavior. Version A might be the currently used version (control), while version B is modified in some respect (treatment). The goal is measure a few key metrics of interest on both groups (such as conversion rate or revenue per email) which are identical to each other except for some change in the content or the subject line of the email.
In this guide, we will use the example of whether featuring products in email campaigns yields more revenue per email open. How do you go about testing if this change is desirable?
The basic A/B testing tells us to split users into two groups: the test group will be exposed to products (treatment) whereas the control group won’t. While it sounds fairly simple, there are immediate questions that end up being non-trivial to answer:
- Should I do 50/50 split of users or 80/20 split?
- If I don’t do 50/50 split, how big should the groups be?
- Do I run the A/B test across multiple campaigns? if so, how many campaigns should be subject to the A/B test?
- Do I change my split for every email campaign or keep them fixed?
The math behind A/B tests
To use an example, let’s choose our metric of interest to be conversions per opened email (customer purchase transactions divided by the number of opened emails).
How many users to test
How can we calculate how many users should be subject to the treatment, e.g., exposed to products in the email, and how many email campaigns should the A/B test span?
The key for a valid test is to achieve “statistical significance”, which is to achieve a level of confidence to either accept or reject the null hypothesis that the treatment does not affect conversions per open. The basic idea behind statistical significance comes down to narrowing the confidence intervals for the metric of interest (in our case, conversions per open). If our treatment (placing products into emails) either improves or degrades our metric of interest, then confidence intervals for the two groups should not overlap. The way we make our confidence intervals narrow enough that they would not overlap is by running the experiment many times! In this case, it means having enough opened emails to achieve a high degree of confidence.
How do we compute the confidence interval?
The computation will depend on the actual measured metric of interest. If our metric of interest is the conversions per open and the observed value is equal to:
then for large enough number of opened emails, the confidence interval would be defined according to the Normal Distribution:
In this formula, n is the number of opened emails and 1.96 is a special constant, often referred to as z-score, associated with the 95% confidence interval. This formula allows us to compute the interval where the actual “true” value of the metric can be confirmed with 95% confidence. In other words, if we were to re-run these experiments over and over again, conversions per open would fall somewhere within this interval 95% of the time.
How many email opens do we need to be 95% confident that this is a true uplift?
The most important thing to observe about conversions per open is that the observed value p is usually very small: 0.2–0.3% (click-to-open rate of 10% and conversion rate of 2% yields 0.2% conversions per open). Suppose, we were to introduce a 10% uplift having increased conversions per open from 0.2% to 0.22%. To achieve statistical significance, we need the lower bound of the confidence interval to be higher than 0.2%, assuming 0.2% is the true conversion rate per email open for the control group (which in practice will have its own associated confidence interval). So the minimum number of open emails to achieve this must be greater than the following value (solving for n in the original confidence interval formula):
If the observed conversion rate per each opened email for the test group is p = 0.22%, n must be greater than 191,696, which means we need almost 200K opened emails to be 95% confident that there is a real uplift over the control group!
Topics to be covered in the next article (2 of 3):
- Understanding the impact on testing of variance in Average Order Value (AOV). Sneak preview - If you have a large variance in order value, your test sample size needs to be considerably larger.
- The nature of distribution of customers. The Pareto distribution (also know as power law) of the lifetime value of your customers has important implications for managing user splits in testing. If not carefully managed, you will get skewed results.
Otherwise, stay tuned for the next article in the series on A/B Testing: A Practical Guide for Marketers.