A/B testing: A Practical Guide for Marketers (part 2 of 3)
In this second overview of A/B testing for marketers, we delve into the complexities related to average order value (AOV), and the distribution of customers as measured by their lifetime value.
The bottom line on AOV as it relates to executing good tests is this: it's important to consider AOV in your testing because it's a key contributor to financial metrics like revenue per email open. As a result, it really impacts the sample size needed for a valid test, particularly if there is a large variance in your AOV. Read on to understand the magnitude of the impact.
Paying attention to the distribution of revenue per customer is also crucial, as a small number of high-value customers can dramatically skew test results. Identifying the highest value customers and ensuring they are properly assigned in splits will be crucial for accurate test results.
The impact of order value
If we are to measure revenue per each email instead of the conversions per email, we would need to add a user's average order value into the testing. This adds another variable to the mix which has its own distribution which adds variance. As a result, a proper test needs a bigger sample size of events to achieve significance. We will not go in depth here about how we can model this statistically, but keep in mind the increased variance associated with the order value when measuring revenue and allocate more email volume toward a test.
Because properly testing the revenue impact of potential improvements is so important, we want to offer a heuristic for roughly estimating the extra amount of samples (in this case, email opens) that would need to be added to create an accurate test.
With several simplifying assumptions in place, the key variables to examine are your average order value (AOV) and the variance there is in your AOV distribution (the AOV feeds directly into the revenue per email open calculation). The more variance you have in AOV, the larger your test size needs to be. As a rule of thumb, for cases that the standard deviation of order value is not much larger than the AOV, we suggest running the test with twice as many opened emails as required for conversion rate. To get a taste of the magnitude of increase related to this extra complexity, consider this example:
if you have an AOV of $100 and your standard deviation is $50 (meaning that 95% of your orders fall in the value range of $0 to $200 for a normally distributed order value), then, using the example from above, you will need to increase the number of email opens tested by 100% (to roughly 400,000 opens).
This general rule of thumb can be used as a way to create a rough approximation of how many emails are required for a proper test. There is an important caveat, however, and that is whether you have a long-tail of high-value orders. If this is the case, the variability will be quite high, as well as the test size required. This points to the necessity of ensuring that a proper split of your highest LTV customers is accurately done (see next section and this topic will be covered in detail in part 3 of 3 - access the full white paper now by Clicking Here).
Nature of users: Pareto Distribution
If things were not complex enough, there is another level of complexity to A/B testing: not all email subscribers are equal. Each user shops with a different frequency and spends different amount of money per order. Revenue distribution among users follows a classical Pareto distribution (also known as a power law distribution): 20% of all users generate 80% of all revenue and 20% of the most active 20% generate 80% of that 80%, and so on (for the more mathematically inclined readers: 100pn % of all users generate 100(1 − p)n percentage of all revenue, where p depends on the alpha parameter of the actual Pareto distribution). This implies that (20%)3 = 0.8% of all users generate (80%)3 > 50% of the total revenue.
As a result, if you are running an A/B test on 100,000 subscribers, it becomes extremely important to carefully split the 800 highest LTV users. If, for whatever reason, 600 of the 800 users get allocated to the control group, your control group will get a massive 25% revenue boost and completely skew the results!
The key lesson here: pay attention to your high lifetime value users, and their purchase recency in order to have good splits for your testing...
Keep in mind - with 100K subscribers, a bad split of your 800 most loyal users will introduce a bias to the results.
What we'll cover in the next chapter of A/B testing: A Practical Guide for Marketers (3 of 3):
- Best practice for performing A/B tests that yield accurate results.
- Designing splits while factoring in recency and frequency.
- Alternative methods for designing splits.
Otherwise, stay tuned for the next article in the series on A/B Testing: A Practical Guide for Marketers.