A/B Testing: A Practical Guide
A/B testing is now almost a required skill for marketers to evaluate the efficacy of different approaches to business. Whether testing consumer response to e-commerce functionality, or a CRM manager testing the effect of a new program for customers, A/B testing of results is a foundational element of decision making.
While there is a good amount of published content and literature on the basics of A/B testing, we thought it would be important to offer a practical guide on the fundamental principles that underpin A/B testing and significance of the results and describe methods for executing statistically sound testing.
This article covers topics relevant to A/B testing in e-commerce, email marketing, customer relationship management (CRM) and general consumer responses to offerings.
The basic idea behind A/B testing is to compare two versions which are identical except for one variation that might affect a user's behavior. Version A might be the currently used version (control), while version B is modified in some respect (treatment). The goal is measure a few key metrics of interest on both groups (such as conversion rate or revenue per email) which are identical to each other except for some change in the content or the subject line of the email.
In this guide we will use the example of whether featuring products in email campaigns yields more revenue per email open. How do you go about testing if this change is desirable?
The basic A/B testing tells us to split users into two groups: the test group will be exposed to products (treatment) whereas the control group won’t. While it sounds fairly simple, there are immediate questions that end up being non-trivial to answer:
- Should I do 50/50 split of users or 80/20 split?
- If I don’t do 50/50 split, how big should the groups be?
- Do I run the A/B test across multiple campaigns? if so, how many campaigns should be subject to the A/B test?
- Do I change my split for every email campaign or keep them fixed?
The math behind A/B tests
To use an example, let’s choose our metric of interest to be conversions per opened email (customer purchase transactions divided by the number of opened emails).
How many users to test
How can we calculate how many users should be subject to the treatment, e.g., exposed to products in the email, and how many email campaigns should the A/B test span?
The key for a valid test is to achieve “statistical significance”, which is to achieve a level of confidence to either accept or reject the null hypothesis that the treatment does not affect conversions per open. The basic idea behind statistical significance comes down to narrowing the confidence intervals for the metric of interest (in our case, conversions per open). If our treatment (placing products into emails) either improves or degrades our metric of interest, then confidence intervals for the two groups should not overlap. The way we make our confidence intervals narrow enough that they would not overlap is by running the experiment many times! In this case, it means having enough opened emails to achieve a high degree of confidence.
How do we compute the confidence interval?
The computation will depend on the actual measured metric of interest. If our metric of interest is the conversions per open and the observed value is equal to:
then for large enough number of opened emails, the confidence interval would be defined according to the Normal Distribution:
In this formula, n is the number of opened emails and 1.96 is a special constant, often referred to as z-score, associated with the 95% confidence interval. This formula allows us to compute the interval where the actual “true” value of the metric can be confirmed with 95% confidence. In other words, if we were to re-run these experiments over and over again, conversions per open would fall somewhere within this interval 95% of the time.
How many email opens do we need to be 95% confident that this is a true uplift?
The most important thing to observe about conversions per open is that the observed value p is usually very small: 0.2–0.3% (click-to-open rate of 10% and conversion rate of 2% yields 0.2% conversions per open). Suppose, we were to introduce a 10% uplift having increased conversions per open from 0.2% to 0.22%. To achieve statistical significance, we need the lower bound of the confidence interval to be higher than 0.2%, assuming 0.2% is the true conversion rate per email open for the control group (which in practice will have its own associated confidence interval). So the minimum number of open emails to achieve this must be greater than the following value (solving for n in the original confidence interval formula):
If the observed conversion rate per each opened email for the test group is p = 0.22%, n must be greater than 210,823, which means we need more than 200K opened emails to be 95% confident that there is a real uplift over the control group!
What about the impact of order value?
Now, if we were to measure revenue per each email instead of the conversions per email, we would need more opened emails to get the statistical significance. User’s order value introduces another variable into the mix (with its own distribution) and creates more variance. Hence, a proper test would need a bigger sample size of events to achieve statistical significance. We will not go in depth here about how we can model this statistically (as it would require a lot more math and assumptions) but keep in mind the increased variance associated with the order value when measuring revenue and allocate more email volume toward a test.
Because properly testing the revenue impact of potential improvements is so important, we want to offer a heuristic for roughly estimating the extra amount of samples (email opens) that would need to be added to create an accurate test. With several simplifying assumptions in place, the key variables to examine are your average order value (AOV) and the variance there is in your AOV distribution (the AOV feeds directly into the revenue per email open calculation). The more variance you have in AOV, the larger your test size needs to be. As a rule of thumb, for cases that the standard deviation of order value is not much larger than the AOV, we suggest running the test with twice as many opened emails as required for conversion rate. For example, if you have an AOV of $100 and your standard deviation is $50 (meaning that 95% of your orders fall in the value range of $0 to $200 for a normally distributed order value), then, using the example from above, you will need to increase the number of email opens tested by 100% (to roughly 400,000 opens).
This general rule of thumb can be used as a way to create a rough approximation of how many emails are required for a proper test. There is an important caveat, however, and that is whether you have a long-tail of high value orders. If this is the case, the variability will be quite high, as well as the test size required. This points to the necessity of ensuring that a proper split of your highest LTV customers is accurately done (see next section).
Nature of users: Pareto Distribution
If things were not complex enough, there is another level of complexity to A/B testing: not all email subscribers are equal. Each user shops with a different frequency and spends different amount of money per order. Revenue distribution among users follows a classical Pareto distribution (also known as a power law distribution): 20% of all users generate 80% of all revenue and 20% of the most active 20% generate 80% of that 80%, and so on (for the more mathematically inclined readers: 100pn % of all users generate 100(1 − p)n percentage of all revenue, where p depends on the alpha parameter of the actual Pareto distribution). This implies that (20%)3 = 0.8% of all users generate (80%)3 > 50% of the total revenue. As a result, if you are running an A/B test on 100,000 subscribers, it becomes extremely important to carefully split 800 highest LTV users. If for whatever reason, 600 of the 800 users get allocated to the control group, your control group will get a massive 25% revenue boost and completely skew the results!
Now that we’ve gone through the basic math behind confidence intervals, statistical significance, and the distribution of revenue over the user population, let’s answer the questions we posed in the beginning of the post: do I need to run A/B tests across multiple campaigns? And if I do, should I change my split for every email campaign or keep them fixed?
As we’ve shown earlier, it takes more than 200,000 opened emails to prove a 10% uplift in conversions per email. It takes significantly more email volume to prove uplift in revenue per email. Many times, it’s hard to reach this much volume with a single campaign. Say, you have 100K subscribers: 50% get allocated to the control group and 50% get allocated to the test group. At a 20% open rate, the test group will generate 10K email open events. This is clearly not enough for any statistical significance, so we need to run multiple campaigns to achieve confidence in the results. Even when the volume of a single email campaign is large, in practice it’s a good idea to run A/B tests in multiple campaigns to offset temporal variations and see how results perform over time.
Furthermore, whether you run an A/B test across multiple campaigns or a single campaign, you need to think about your user split. As illustrated earlier, with a typical Pareto Distributions, over 50% of the revenue will be generated by the 0.8% of the user base (the disparity may not be as drastic but it will certainly exhibit a similar order of magnitude). As a result, with 100K subscribers, a bad split of your 800 most loyal users will clearly introduce a huge bias to the results.
There are two major approaches that marketers use to address the user split challenges when running A/B tests across multiple campaigns.
Carefully designed split
The best practice is to carefully design a user split. It means that we cannot just take an audience of users (even if it’s a pre-segmented audience, such as "everyone who made a transaction in the last year and opened at least one email in the past 3 months") and randomly assign each user to the control and test group. We have to sub-segment users into cells based on transaction recency and frequency (let’s ignore for now the order value).
- Users that made no transaction over the last year
- Users that made 1–3 transactions over the last year with the last transaction in the last 2 weeks
- Users that made 4–9 transactions over the last year with the last transaction in the last 2 weeks
- Users that made 10+ transactions over the last year with the last transaction in the last 2 weeks
- Users that made 1–3 transactions over the last year with the last transaction in weeks 3-10 (but no transaction in the last 2 weeks)
- Users that made 4–9 transactions over the last year with the last transaction in weeks 3-10 (but no transaction in the last 2 weeks)
- Users that made 10+ transactions over the last year with the last transaction in weeks 3-10 (but no transaction in the last 2 weeks)
- Users that made 1–3 transactions over the last year with the last transaction in the last 6 months (but no transaction in the last 10 weeks)
- Users that made 4–9 transactions over the last year with the last transaction in the last 6 months (but no transaction in the last 10 weeks)
- Users that made 10+ transactions over the last year with the last transaction in the last 6 months (but no transaction in the last 10 weeks)
Graphically, this segmenting would look like:
Now within each of these cells, you have to randomly assign users 50/50 into test and control groups. Sub-segmenting users based on transaction frequency over the last year ensures that you’re evenly allocating users based on their loyalty. However, users are not static: there are constantly new users that get activated and there are old loyal users that churn. So in addition to transaction frequency, you need to evenly allocate users based on transaction recency. The sample allocation above ensures an even distribution between the control and test groups with respect to both transaction frequency and recency. However, it completely ignores the order value: if your order value has a high variance (e.g., if you sell furniture, appliances, or travel packages), it’s important to incorporate the user’s average order value as another variable in your sub-segmentation.
If you designed your user split carefully, you can now do a “dry run” for 1-2 weeks without any treatment to either group to ensure that the two groups are generating roughly the same conversions per open and revenue per open. Once you ensure that the two groups are within 3-5% of each other, you can start the treatment to the test group and keep the user split with no changes for a few weeks (assuming you run multiple email campaigns per week). After a few weeks, you may then need to do a new split and again ensure that the groups are equivalent by doing a new dry run.
Random reshuffle for each campaign
For marketers with less access to data, there is a less precise second option to run A/B tests across multiple campaigns. You can create a user split by randomly re-assigning each user to either group before each email campaign. Assuming that you have enough campaigns (the general rule of thumb is that you need at least 30 samples), individual differences between the user splits for each campaign will mutually offset each other and get “washed out”.
While in theory this is a viable option and is often preferred by marketers because of its simplicity, in practice, the test is violating the fundamental principle of “all things held constant” which is essential to A/B testing. Now when you do revenue and transaction attribution, it becomes hard to say with confidence what caused the user to make a purchase. On Monday, the user may be in the test group, while on Wednesday she is in the control group. Now if the user makes a purchase on Thursday, was this because of the Wednesday’s campaign or Monday’s campaign? Last click attribution methodology says that the purchase was driven by Wednesday’s campaign, but in practice the user may have first seen the product in Monday’s campaign and after a 2-day consideration period clicked through the Wednesday’s campaign to go immediately to that product to buy it.
We actually have observed in client tests how random reshuffle may introduce a pattern where the control group’s results may get stronger over time, while the test group maintains a steady performance. This is an indicator that there is a leak of information between the test group and the control group and the control group’s results may be getting biased by the treatment. This ‘cross-contamination’ can reduce the technical validity of a test by affecting the control group’s performance.
While A/B tests sound simple in theory, there is a lot of science behind them. The best practices require:
- A careful split of users between the test group and the control group accounting for each user’s transaction frequency, recency, and order value
- Careful planning of the necessary email volume to achieve statistical significance of the results
- Performing a “dry run” for 1–2 weeks prior to launch of the A/B test to ensure the user split is fair (in case of a high AOV and infrequent purchases, you may need to look at the performance across multiple weeks prior to launch)
- Running A/B tests across multiple campaigns for a longer period of time (at least 2–3 weeks) to both maximize the number of samples (hence, achieve a tighter confidence interval) and offset any temporal biases.