A/B testing: A Practical Guide for Marketers - (Part 3 of 3)
In this last of our series on practical A/B testing, we discuss best practices for designing high-quality tests and continue with our example of testing in promotional emails through its conclusion.
We delve into the nuances of how to design a high-quality user split and what to do with the split if the A/B test spans multiple email campaigns. We discuss tradeoffs of different methods and examine how user purchase recency and frequency affect the balance of splits. Please read on, and if you want to access the entire whitepaper, click here.
User Splits: Why They Are Important
Now that we’ve gone through the basic math behind confidence intervals, statistical significance, and the distribution of revenue over the user population, let’s answer the questions we posed in the beginning of the series: do I need to run A/B tests across multiple campaigns? And if I do, should I change my split for every email campaign or keep them fixed?
As we’ve shown earlier, it takes at least 200,000 opened emails to prove a 10% uplift in conversions per email. It takes significantly more email volume to prove uplift in revenue per email. Many times, it’s hard to reach this much volume with a single campaign. Say, you have 100K subscribers: 50% get allocated to the control group and 50% get allocated to the test group. At a 20% open rate, the test group will generate 10K email open events. This is clearly not enough for any statistical significance, so we need to run multiple campaigns to achieve confidence in the results. Even when the volume of a single email campaign is large, in practice it’s a good idea to run A/B tests over multiple campaigns to offset temporal variations and see how results perform over time.
Furthermore, whether you run an A/B test across multiple campaigns or a single campaign, you need to think about your user split. As illustrated earlier, with a typical Pareto Distributions, over 50% of the revenue will be generated by the 0.8% of the user base (the disparity may not be as drastic but it will certainly exhibit a similar order of magnitude). As a result, with 100K subscribers, a bad split of your 800 most loyal users will clearly introduce a huge bias to the results.
There are two major approaches that marketers use to address the user split challenges when running A/B tests across multiple campaigns.
Carefully designed split
The best practice is to carefully design a user split. It means that we cannot just take an audience of users (even if it’s a pre-segmented audience, such as "everyone who made a transaction in the last year and opened at least one email in the past 3 months") and randomly assign each user to the control or test group. We have to sub-segment users into cells based on transaction recency and frequency (let’s ignore for now the order value).
- Users that made no transaction over the last year
- Users that made 1–3 transactions over the last year with the last transaction in the last 2 weeks
- Users that made 4–9 transactions over the last year with the last transaction in the last 2 weeks
- Users that made 10+ transactions over the last year with the last transaction in the last 2 weeks
- Users that made 1–3 transactions over the last year with the last transaction in weeks 3-10 (but no transaction in the last 2 weeks)
- Users that made 4–9 transactions over the last year with the last transaction in weeks 3-10 (but no transaction in the last 2 weeks)
- Users that made 10+ transactions over the last year with the last transaction in weeks 3-10 (but no transaction in the last 2 weeks)
- Users that made 1–3 transactions over the last year with the last transaction in the last 6 months (but no transaction in the last 10 weeks)
- Users that made 4–9 transactions over the last year with the last transaction in the last 6 months (but no transaction in the last 10 weeks)
- Users that made 10+ transactions over the last year with the last transaction in the last 6 months (but no transaction in the last 10 weeks)
Graphically, this segmenting would look like:
Now within each of these cells, you have to randomly assign users 50/50 into test and control groups. Sub-segmenting users based on transaction frequency over the last year ensures that you’re evenly allocating users based on their loyalty. However, users are not static: there are constantly new users that get activated and there are old loyal users that churn. So in addition to transaction frequency, you need to evenly allocate users based on transaction recency. The sample allocation above ensures an even distribution between the control and test groups with respect to both transaction frequency and recency. However, it completely ignores the order value: if your order value has a high variance (e.g., if you sell furniture, appliances, or travel packages), it’s important to incorporate the user’s average order value as another variable in your sub-segmentation.
If you designed your user split carefully, you can now do a “dry run” for 1-2 weeks without any treatment to either group to ensure that the two groups are generating roughly the same conversions per open and revenue per open. Once you ensure that the two groups are within 3-5% of each other, you can start the treatment to the test group and keep the user split with no changes for a few weeks (assuming you run multiple email campaigns per week). After a few weeks, you may then need to do a new split and again ensure that the groups remain equivalent by doing a new dry run.
Random reshuffle for each campaign
For marketers with less access to data, there is a less precise second option to run A/B tests across multiple campaigns. You can create a user split by randomly re-assigning each user to either group before each email campaign. Assuming that you have enough campaigns (the general rule of thumb is that you need at least 30 samples), individual differences between the user splits for each campaign will mutually offset each other and get “washed out”.
While in theory this is a viable option and is often preferred by marketers because of its simplicity, in practice, the test is violating the fundamental principle of “all things held constant” which is essential to A/B testing. Now when you do revenue and transaction attribution, it becomes hard to say with confidence what caused the user to make a purchase. On Monday, the user may be in the test group, while on Wednesday she is in the control group. Now if the user makes a purchase on Thursday, was this because of the Wednesday’s campaign or Monday’s campaign? Last click attribution methodology says that the purchase was driven by Wednesday’s campaign, but in practice the user may have first seen the product in Monday’s campaign and after a 2-day consideration period clicked through the Wednesday’s campaign to go immediately to that product to buy it.
We actually have observed in client tests how random reshuffle may introduce a pattern where the control group’s results may get stronger over time, while the test group maintains a steady performance. This is an indicator that there is a leak of information between the test group and the control group and the control group’s results may be getting biased by the treatment. This ‘cross-contamination’ can reduce the technical validity of a test by affecting the control group’s performance.
While A/B tests sound simple in theory, there is a lot of science behind them. You need to design a careful split of users between the test group and the control group accounting for each user’s transaction frequency, recency, and order value. It's recommended to do a “dry run” for 1–2 weeks prior to launch of the A/B test to ensure the user split is fair. Finally, it's important that you run A/B tests across multiple campaigns for a longer period of time (at least 2–3 weeks) to both maximize the number of samples (hence, achieve a tighter confidence interval) and offset any temporal biases.
For a comprehensive review of A/B testing, read the entire whitepaper - click here to access.