Online Experiments Tricks – Variance Reduction

Why do we need variance reduction?

When we do online experiments or A/B testing, we need to ensure our test has high statistical power so that we have a high probability to find the experimental effect if it does exist. What are the factors that might affect power? Sample sizes, sampling variance of the experiment metric, significance level alpha, and effect size.

The canonical way to improve power is to increase the sample size. However, the dynamic range is limited since the minimum detectable effect MDE is proportional to 1/sqrt(sample_size). Also, in reality, getting more samples or running an experiment for a longer time to increase the sample size might not always be easy or feasible.

The next best thing we can do is to decrease the sampling variance of the experiment metrics. The easiest thing one could do is to transform metrics. Winsorize, binarize, or other more complicated metrics transformations will help decrease the variance significantly. However, metrics transformation introduces bias. So there is a bias-variance tradeoff with this approach.

Many other variance reduction methods are developed and productionalized in the tech industry to improve the sensitivity/power of experiments. In this article, I will walk through some of the popular variance reduction methods and demonstrate some simple examples in Python:

Stratification and post-stratification
CUPED (controlled-experiment using pre-experiment data)
Variance-Weighted Estimators

ML-based methods:

CUPAC (control using predictions as covariates)
MLRATE (machine learning regression-adjusted treatment effect estimator)

If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.

Stratification

Stratified sampling buckets the population into k strata (e.g., countries), and then the experiment random samples individuals from each stratum independently. Let Y_strat be the treatment effect under the stratified sampling and let p_k indicate the proportion of sample size from strata k. The equations below tell us that the treatment effect is the pooled average of the treatment effect in each stratum and is unbiased. The variance is the weighted average of the within-strata variance and effectively removes the between-strata variance. The variance is smaller than the variance under simple random sampling, which includes both the within-strata and the between-strata variance (more info in this paper).

Pros and Cons

The stratification method provides an unbiased estimate of the treatment effect and effectively removes the between-strata variance. However, in practice, it is usually very hard to implement stratified sampling before experiments.

“In the online world, because we collect data as they arrive over time, we are usually unable to sample from strata formed ahead of time.” (Deng, Xu, Kohavi, & Walker, 2013)
In practice, implementing stratified sampling is complicated and expensive. It “requires a queue system and the use of multiple machines.” (Xie & Aurisset, 2016)

Post-stratification

In practice, it is a lot more common to do post-stratification than stratification. Post-stratification randomly samples the population first and then places individuals into strata. Similar to stratification, post-stratification can achieve similar variance reduction.

Here is a very simple example where we generate data from four different normal distributions (4 strata), randomly assign individuals to the treatment and control group, add a treatment effect to the treatment group, and visualize the treatment effects via bootstrapping. The treatment effect is calculated as the mean difference between treatment and control without stratification and as the average of the mean difference for each stratum with stratification. From our simple example, we do see a variance reduction with stratification. Crucially, the mean is unchanged, so we should be able to see any experimental effect on the means better now that the variance has been decreased.

import pandas as pd
import numpy as np
import hvplot.pandas
from scipy.stats import pearsonr
from scipy.optimize import minimize

def generate_strata_data(treatment_effect, size):
    # for each strata, generate y from a normal distribution
    df1 = pd.DataFrame({'strata': 1, 'y': np.random.normal(loc=10, scale=1, size=size)})
    df2 = pd.DataFrame({'strata': 2, 'y': np.random.normal(loc=15, scale=2, size=size)})
    df3 = pd.DataFrame({'strata': 3, 'y': np.random.normal(loc=20, scale=3, size=size)})
    df4 = pd.DataFrame({'strata': 4, 'y': np.random.normal(loc=25, scale=4, size=size)})
    df = pd.concat([df1, df2, df3, df4])
    # random assign rows to two groups 0 and 1 
    df['group'] = np.random.randint(0,2, df.shape[0])
    # for treatment group add a treatment effect 
    df.loc[df["group"] == 1, 'y'] += treatment_effect
    return df   

def meandiff(df):
    return df[df.group==1].y.mean() - df[df.group==0].y.mean()

def strata_meandiff(df):
    get_sum = 0
    for i in df.strata.unique():
        get_sum += meandiff(df[df.strata==i])
    return get_sum/len(df.strata.unique())


meandiff_lst = []
strata_meandiff_lst = []
for i in range(100):
    df = generate_strata_data(treatment_effect=1, size=100)
    meandiff_lst.append(meandiff(df))
    strata_meandiff_lst.append(strata_meandiff(df))
    
(
    pd.DataFrame(strata_meandiff_lst).hvplot.kde(label='Stratification') 
    * pd.DataFrame(meandiff_lst).hvplot.kde(label='Original')
)

*Treatment effect with and without stratification*

CUPED

CUPED (Controlled-experiment using pre-experiment data) was first introduced by Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker from Microsoft in 2013 and has been widely used in big tech companies such as Netflix, bookings.com, TripAdvisor, and many others. CUPED uses pre-experiment data X (e.g., pre-experiment values of Y) as a control covariate:

In other words, the variance of Y is reduced by (1-Corr(X, Y)). We would need the correlation between X and Y to be high for CUPED to work well. In the original paper, it is recommended to use the pre-experiment value of Y as X.

Here is an example with one sampling. We see that the variance of Y for both the control group and the treatment group has decreased and the ratio of the variance of Y_cuped and Y is 0.2789, which is the same as the (1-Corr(X,Y)) value based on the theoretical equation above. Reducing the variance reduces the overlap between these two distributions and makes it easier to see an experimental effect.

def generate_data(treatment_effect, size):
    # generate y from a normal distribution
    df = pd.DataFrame({'y': np.random.normal(loc=0, scale=1, size=size)})
    # create a covariate that's corrected with y 
    df['x'] = minimize(
        lambda x: 
        abs(0.95 - pearsonr(df.y, x)[0]), 
        np.random.rand(len(df.y))).x
    # random assign rows to two groups 0 and 1 
    df['group'] = np.random.randint(0,2, df.shape[0])
    # for treatment group add a treatment effect 
    df.loc[df["group"] == 1, 'y'] += treatment_effect
    return df    

df = generate_data(treatment_effect=1, size=10000)
theta = df.cov()['x']['y'] / df.cov()['x']['x']
df['y_cuped'] = df.y - theta * df.x

(
    df.hvplot.kde('y', by='group', xlim = [-5,5], color=['#F9a4ba', '#f8e5ad']) 
    + df.hvplot.kde('y_cuped', by='group', xlim = [-5,5], color=['#F9a4ba', '#f8e5ad'])
)

The variance of Y before and after CUPED adjustment

Here we simulate our experiment many times and calculate the mean difference of the control group and the treatment group and get the distribution of the treatment effect. Note that Y_cuped is not an unbiased estimator of Y. But the mean difference of Y_cuped is an unbiased estimator of the mean difference of Y. Based on the figure below, it is clear that in our simple case, CUPED decreased the variance of our treatment.

def generate_data(treatment_effect, size):
    # generate y from a normal distribution
    df = pd.DataFrame({'y': np.random.normal(loc=0, scale=1, size=size)})
    # create a covariate that's corrected with y 
    df['x'] = minimize(
        lambda x: 
        abs(0.95 - pearsonr(df.y, x)[0]), 
        np.random.rand(len(df.y))).x
    # random assign rows to two groups 0 and 1 
    df['group'] = np.random.randint(0,2, df.shape[0])
    # for treatment group add a treatment effect 
    df.loc[df["group"] == 1, 'y'] += treatment_effect
    return df   

def meandiff(df):
    return df[df.group==1].y.mean() - df[df.group==0].y.mean()

def cuped_meandiff(df):
    theta = df.cov()['x']['y'] / df.cov()['x']['x']
    df['y_cuped'] = df.y - theta * df.x
    return df[df.group==1].y_cuped.mean() - df[df.group==0].y_cuped.mean()


meandiff_lst = []
cuped_meandiff_lst = []
for i in range(200):
    df = generate_data(treatment_effect=1, size=100)
    meandiff_lst.append(meandiff(df))
    cuped_meandiff_lst.append(cuped_meandiff(df))
    
pd.DataFrame(cuped_meandiff_lst).hvplot.kde(label='CUPED') * pd.DataFrame(meandiff_lst).hvplot.kde(label='Original')

*Treatment effect with and without CUPED adjustment*

Pros and Cons

CUPED is super easy to use and implement. However, the covariate selection can be tricky, especially when the pre-experiment measure of the target variable is not available. The covariate has to be correlated with the target measure, but not related to the experiment. In addition, the original paper did not go through the scenario where there are multiple covariates. This blog post goes through the algebra extending CUPED from one covariate to multiple covariates. Another solution is to use ML to construct the control variate, which we will talk about later.

Variance-Weighted Estimators

The variance-weighted estimators method was developed by Kevin Liou and Sean Taylor from Facebook and Lyft in 2020. The main idea of this method is to give more weight to users who have lower pre-experiment variance.

This method relaxed the homoscedastic variance assumption and instead assumes that each individual can have its own metric variance. For example, the figure above illustrates two individuals, where one (green line) has a higher variance than the other (blue line).

In the equations below, Y is the metric of interest, δ is the treatment effect, Z is an indicator function of the treatment. To minimize the variance of the treatment effect, we weight each user by the inverse of the variance.

Similar to CUPED, the variance-weighted estimators also use pre-experiment data. The paper mentioned several methods to estimate variance include using the empirical variance of the pre-experiment time series data, building an ML model, and using Empirical Bayes estimators. The easiest way is to just use the empirical variance.

Weighting induced biases. To reduce bias, the paper proposed a method to bucket users based on their pre-experiment variances, calculates the mean of the treatment effect and pre-experience variance within each bucket, and then calculates the weighted treatment effect across strata.

So overall in practice, we would estimate the variance, bucket the variance into k strata, weigh each stratum by the inverse variance, and calculate the weighted treatment effect.

Here is a very simple example demonstrating this method. I start with four strata and we know the metric variance of each stratum. Then we can weigh the treatment effect by the inverse variance for each stratum. For this example, the variance weighted estimator provides additional benefits of variance reduction in addition to the stratification methods.

def generate_strata_data(treatment_effect, size):
    # for each strata, generate y from a normal distribution
    df1 = pd.DataFrame({'strata': 1, 'pre_experient_variance':1, 'y': np.random.normal(loc=10, scale=1, size=size)})
    df2 = pd.DataFrame({'strata': 2, 'pre_experient_variance':2, 'y': np.random.normal(loc=15, scale=2, size=size)})
    df3 = pd.DataFrame({'strata': 3, 'pre_experient_variance':3, 'y': np.random.normal(loc=20, scale=3, size=size)})
    df4 = pd.DataFrame({'strata': 4, 'pre_experient_variance':4, 'y': np.random.normal(loc=25, scale=4, size=size)})
    df = pd.concat([df1, df2, df3, df4])
    # random assign rows to two groups 0 and 1 
    df['group'] = np.random.randint(0,2, df.shape[0])
    # for treatment group add a treatment effect 
    df.loc[df["group"] == 1, 'y'] += treatment_effect
    return df   

def variance_weighted_meandiff(df):
    weighted_effect_sum = 0
    weights_sum = 0
    for i in df.strata.unique():
        #For each strata, we then calculate its average treatment effect
        treatment_effect_strata = meandiff(df[df.strata==i])
        # estimate its weight based on the inverse within-group estimated variance (such as mean)
        weights_strata = 1/df[df.strata==i]['pre_experient_variance'].mean()
        # calculate the sum of weighted treatment effect
        weighted_effect_sum +=  treatment_effect_strata *  weights_strata
        # calculate the sum of weights
        weights_sum += weights_strata    
    return weighted_effect_sum/weights_sum

def strata_meandiff(df):
    get_sum = 0
    for i in df.strata.unique():
        get_sum += meandiff(df[df.strata==i])
    return get_sum/len(df.strata.unique())


meandiff_lst = []
variance_weighted_meandiff_lst = []
strata_meandiff_lst = []
for i in range(200):
    df = generate_strata_data(treatment_effect=1, size=100)
    meandiff_lst.append(meandiff(df))
    variance_weighted_meandiff_lst.append(variance_weighted_meandiff(df))
    strata_meandiff_lst.append(strata_meandiff(df))
   
(
    pd.DataFrame(variance_weighted_meandiff_lst).hvplot.kde(label='Variance Weighted Estimator')
    * pd.DataFrame(meandiff_lst).hvplot.kde(label='Original')
    * pd.DataFrame(strata_meandiff_lst).hvplot.kde(label='Statification')
)

*Treatment effect with and without using the variance weighted estimator*

Pros and Cons

The variance-weighted estimator models individual pre-experiment variance as weight and it can be used as a nice extension to other methods such as CUPED. It works well when there is a highly skewed variance between users and when the pre-treatment variance is a good indicator of the post-treatment variance.
However, when the variance of the pre-treatment variance is low or when the pre- and post-experiment variances are not consistent, the variance-weighted estimator might not work. In addition, the variance-weighted estimator is not unbiased. Managing bias is important for this method.

ML-based methods

There are several ML-based methods for variance reduction that were developed in recent years. I am going to briefly walk through two ML-based methods — CUPAC and MLRATE. Since the original papers did not describe the exact algorithms and the details of their ML models, I will only touch on my understandings of the high-level ideas and not demonstrate examples for these methods.

CUPAC

CUPAC (Control Using Predictions As Covariates) was introduced by Jeff Li, Yixin Tang, and Jared Bauman from Doordash in 2020. It is an extension of CUPED. Instead of selecting a pre-experiment covariate X that’s not correlated with the treatment, CUPAC uses the results of a machine learning model as its control variate. The goal of the machine learning model is to “maximize the partial correlation between the prediction covariate (CUPAC) and the target metric”.

Assuming we have pre-experiment metrics, X1, X2, X3, and X4. Essentially, what this method does is to use some machine learning model to predict Y using X1, X2, X3, and X4. And then, we can use the predicted value as the control covariate in CUPED.

MLRATE

MLRATE (machine learning regression-adjusted treatment effect estimator) was proposed by Guo and others from Princeton and Facebook in 2021.

Similar to CUPAC, MLRATE also uses an ML model to predict Y from X. Let’s call the predicted value g(X). Instead of subtracting g(X) from Y, MLRATE includes g(X) together with the treatment indicator in a regression model and then calculates the regression-adjusted treatment effect. The figure below demonstrated this regression model:

First, we start with a vector of the covariate or a matrix of covariates X. We then learn and apply a cross-fitting supervised learning algorithm. Cross-fitting is used to avoid over-fitting bias. The cross-fitting procedure is as follows: We split our data into k splits. For each split, we train our data on the sample that’s not in the current split and get a function g. Then we use the X in the current split and get the predicted value of g(X) for the current split.
Next, we conduct a regression model where we predict experiment metric Y with the regressors: T — treatment indicator, g(X) — predicted value from the cross-fitting ML, and T(g(x)-g).
The OLS estimator for α1 is the treatment effect we are interested in.

Other ML-based methods

There are otherML-based methods used in the industry. For example, Poyarkov and others from Yandex developed the Boosted decision tree regression adjustment in 2016.

Overall, this article summarizes some of the popular variance reduction methods in the industry — post-stratification, CUPED, Variance-Weighted Estimators, and ML-based methods CUPAC and MLRATE. In practice, CUPED is widely used and productionalized in tech companies and ML-based methods are often used to incorporate multiple covariates. It is also common to combine multiple methods to achieve optimal variance reduction. Hope you find this article helpful. Thanks!

Acknowledgment: Thank you so much Anthony Fu and Jim Bednar for providing guidance and feedback to this article. Thank you Kevin Liou and Sean Taylor for clarifying my questions around the Variance-Weighted Estimators.

References

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

We’ll let you know when we release more technical education.

Why do we need variance reduction?

Stratification

Pros and Cons

Post-stratification

CUPED

Pros and Cons

Variance-Weighted Estimators

Pros and Cons

ML-based methods

CUPAC

MLRATE

Other ML-based methods

Enjoy this article? Sign up for more AI updates.

Related

Reader Interactions

About Sophia Yang

Leave a Reply

Footer

About TOPBOTS