If you’ve read my previous post, you already know why I think you should move to Bayesian A/B testing. In this post, I give a short overview over the statistical models behind Bayesian A/B tests, and present the ways we implemented them at Wix.com — where we deal with a massive scale of A/B tests.

I wrote some practical examples in Python along this post. You can easily replicate them by copy-pasting the code from here and here. While the code is rather simple, the formulas I derive can be a little advanced; however I don’t think it is critical to understand all the derivations to use the code.

If you don’t have a background in Statistics and Bayesian inference, I recommend you read the articles in my references section. Also, a lot of concepts I use have links to Wikipedia, where they are explained in more detail.

## The Bayesian A/B Test Model

**If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.**

Bayesian statistics is built on two main concepts: the prior distribution — what we “know” about the KPI before the test, and the posterior distribution — what we know after gathering data. We find the posterior by “updating” the prior with our data. In an A/B test context, we try to keep it simple and use prior distributions that can be easily updated (conjugate priors), like the Beta-Binomial model.

The Beta-Binomial model is used to model binary data such as conversions or clicks (“did the user convert or not?”). I’ll also review the Normal-Normal model, which is used for continuous data (for example revenue per user).

In the Beta-Binomial model we assume that the conversion rate *Pᴀ* has a Beta distribution with parameters *α* and *β*. A common choice for *α* and *β* is 1, which results with a uniform distribution (sometimes referred to as an “uninformative prior”). I discuss the choice of the prior in more detail in the appendix, but for now let’s continue as if they have already been chosen.

We denote the count of converted users with *Xᴀ *and the count of all users (converted or not) with *nᴀ*, and we model *Xᴀ | Pᴀ ~ Bin(nᴀ, pᴀ). *Since *Pᴀ ~ Beta(α, β)*, by using Bayes’ theorem we get:

Meaning we “update” our prior by adding the number of successes to *α* and the number of failures to *β*. Pretty simple, don’t you think?

In the Normal-Normal model, we assume that the expected revenue per user (or any other continuous metric) *μᴀ* has a Gaussian distribution with parameters *μ₀* and *σ²₀/n₀*. Here we denote the sample average with *X**̄ᴀ* and the sample standard deviation with *sᴀ*, and we assume that *X**̄ᴀ* is approximately Normally distributed as well. This time, the update rule is a bit more tricky:

The posterior’s expected value is a weighted average of the Prior mean and the sample mean, and the weights are inverse to their variances.

All these formulas may seem a little overwhelming, but the calculations in NumPy are pretty simple:

x_a, n_a = 254, 1283 # converted and total users in A x_b, n_b = 289, 1321 # converted and total users in B m_a, s_a = 52.3, 14.1 # average and standard deviation over all users in A m_b, s_b = 52.8, 13.7 # average and standard deviation over all users in B alpha_0, beta_0 = 1, 1 # Beta prior mu0, s0, n0 = 0, 1, 0 # Gaussian prior

Snippet 1: Setting up some dummy data

import numpy as np # updating Beta prior alpha_a = alpha_0 + x_a beta_a = beta_0 + n_a - x_a alpha_b = alpha_0 + x_b beta_b = beta_0 + n_b - x_b # updating Gaussian prior is a bit more tricky inv_vars_a = n0 / np.power(s0, 2), n_a / np.power(s_a, 2) mu_a = np.average((mu0, m_a), weights=inv_vars_a) sd_a = 1 / np.sqrt(np.sum(inv_vars_a)) inv_vars_b = n0 / np.power(s0, 2), n_b / np.power(s_b, 2) mu_b = np.average((mu0, m_b), weights=inv_vars_b) sd_b = 1 / np.sqrt(np.sum(inv_vars_b))

Snippet 2: Updating posteriors

## How To Calculate (Most Metrics)

Now that we’ve found the posterior distributions of *Pᴀ* and *Pʙ*, we want to calculate inferential metrics, such as Credibility Intervals, the Probability B is Better and each version’s Risk. The most common and simple way to do this is by using Monte Carlo simulations. But at Wix.com we have thousands of KPIs in hundreds of A/B tests every day, and using simulations isn’t scalable. Instead, we use two methods: Gaussian Quadratures (more on that later) and… Central Limit Theorem (CLT) approximations.

You might be thinking — isn’t the CLT more of a frequentist kind of thing? And can the CLT even be applied to the Beta distribution? You’ll be right to question that. However, when you have a sample size of thousands in every A/B test, the CLT approximation for the Beta distribution is “good enough”. You can validate me by doing your own simulations (or looking at mine).

Here we’re going to approximate the Credibility Intervals and the Probability B is Better with the CLT. Again, while the formulas can be a bit exhaustive, the code with SciPy is almost a one-liner. We denote the difference *Pʙ – Pᴀ *with* D₁* and the log-ratio *log Pʙ/Pᴀ* with *D₂*. The latter is used when we report the relative uplift instead of the percentage-point difference. We approximate *D₁*’s and *D₂*‘s distribution with:

To complete the formulas above, we denote the digamma function and the first polygamma function with *ψ* and *ψ*₁ respectively. You don’t really need to know what they do, just where to find them in SciPy! The equations below are simply copied from Wikipedia:

Using these formulas, we can easily calculate the probability B is better with *P(D₁ > 0)*, and the Credibility Interval with the quantiles of *D₁* or *D₂:*

from scipy.stats import norm, beta from scipy.special import digamma, polygamma log_beta_mean = lambda a, b: digamma(a) - digamma(a + b) var_beta_mean = lambda a, b: polygamma(1, a) - polygamma(1, a + b) d1_beta = norm(loc=beta.mean(alpha_b, beta_b) - beta.mean(alpha_a, beta_a), scale=np.sqrt(beta.var(alpha_a, beta_a) + beta.var(alpha_b, beta_b))) d2_beta = norm(loc=log_beta_mean(alpha_b, beta_b) - log_beta_mean(alpha_a, beta_a), scale=np.sqrt(var_beta_mean(alpha_a, beta_a) + var_beta_mean(alpha_b, beta_b))) print(f'The probability the conversion in B is higher is {d1_beta.sf(0)}') # The probability the conversion in B is higher is 0.9040503042127781 print(f'The 95% crediblity interval of (p_b/p_a-1) is {np.exp(d2_beta.ppf((.025, .975))) - 1}') # The 95% crediblity interval of (p_b/p_a-1) is [-0.0489547 0.28350359]

Snippet 3: Calculating (some) inferential metrics for the Beta-Binomial model

For the Normal-Normal case, *D₁* is pretty straightforward. But what if we want to use the relative difference instead? We’ve decided to use the Delta method to find the approximate distribution of *ln μᴀ*. Now you might be thinking — how the heck can I take the logarithm of a Gaussian random variable? And (again) you’ll be completely right — the Gaussian distribution’ support includes negative numbers and its logarithm isn’t properly defined. But (again) with a sample size of tens of thousands, it’s a “good enough” approximation as *μᴀ*’s distribution is “pretty far away” from 0.

d1_norm = norm(loc=mu_b - mu_a, scale=np.sqrt(sd_a ** 2 + sd_b ** 2)) d2_norm = norm(loc=np.log(mu_b) - np.log(mu_a), scale=np.sqrt((sd_a / mu_a)**2 + (sd_b / mu_b)**2)) print(f'The probability the average income in B is 2% lower (or worse) is {d2_norm.cdf(np.log(.98))}') # The probability the average income in B is 2% lower (or worse) is 0.00011622384023304196 print(f'The 95% crediblity interval of (m_b - m_a) is {d1_norm.ppf((.025, .975))}') # The 95% crediblity interval of (m_b - m_a) is [-0.04834271 1.94494332]

Snippet 4: Calculating (some) inferential metrics for the Normal-Normal model

## Calculating the Risk

Now we have finally arrived to the important part: The Risk measure is the most important measure in Bayesian A/B testing. It replaces the P-value as a decision rule, but also serves as a stopping rule — since the Bayesian A/B test has a dynamic sample size.

It is interpreted as “When B is worse than A, If I choose B, how many conversions am I expected to lose?”, and formally, it is defined as:

with ** 1** as an indicator function. Notice the integral on the third line — I really hate that integral! It doesn’t have an analytical solution, and

**we can’t approximate it with the CLT**. But as I previously mentioned, Monte-Carlo simulations are NOT an option for us. So what did we do?

Gaussian quadratures (GQ) are a cool way of approximating integrals with a weighted sum of a small number of nodes. The nodes and weights are calculated by the GQ algorithm. Formally, we find *n* nodes (*x*) and weights (*w*) that best approximate the integral of *g *with a summation:

As we saw earlier, the Risk metric is an integral — so we can try to approximate it with a GQ. First, let’s simplify it’s expression:

Now *ξᴀ *is an integral we *can* approximate with a Gaussian quadrature! We can accurately calculate the Risk with about 20 nodes, instead of thousands in a Monte-Carlo simulation:

We implemented this approximation using scipy.special.roots_hermitnorm and a work-around over scipy.special.roots_sh_jacobi (more on the work-around in my notes). This isn’t *very* fast, but it’s the fastest way we found.

from scipy.special import roots_hermitenorm # , roots_sh_jacobi from orthogonal import roots_sh_jacobi # The following throws an integer overflow error # if using scipy when a + b are too large # Use the log trick instead (see my PR at scipy / walk around in repo) def beta_gq(n, a, b): x, w, m = roots_sh_jacobi(n, a + b - 1, a, True) w /= m return x, w nodes_a, weights_a = beta_gq(24, alpha_a, beta_a) nodes_b, weights_b = beta_gq(24, alpha_b, beta_b) gq = sum(nodes_a * beta.cdf(nodes_a, alpha_b, beta_b) * weights_a) + \ sum(nodes_b * beta.cdf(nodes_b, alpha_a, beta_a) * weights_b) risk_beta = gq - beta.mean((alpha_a, alpha_b), (beta_a, beta_b)) print(f'The risk of choosing A is losing {risk_beta[0]} conversions per user.\n' f'The risk of choosing B is losing {risk_beta[1]} conversions per user.') # The risk of choosing A is losing 0.021472801833822552 conversions per user. # The risk of choosing B is losing 0.0007175909729974506 conversions per user. def norm_gq(n, loc, scale): x, w, m = roots_hermitenorm(n, True) x = scale * x + loc w /= m return x, w nodes_a, weights_a = norm_gq(24, mu_a, sd_a) nodes_b, weights_b = norm_gq(24, mu_b, sd_b) gq = sum(nodes_a * norm.cdf(nodes_a, mu_b, sd_b) * weights_a) + \ sum(nodes_b * norm.cdf(nodes_b, mu_a, sd_a) * weights_b) risk_norm = gq - norm.mean((mu_a, mu_b)) print(f'The risk of choosing A is losing {risk_norm[0]}$ per user.\n' f'The risk of choosing B is losing {risk_norm[1]}$ per user.') # The risk of choosing A is losing 0.9544550905343314$ per user. # The risk of choosing B is losing 0.006154785995697409$ per user.

Snippet 5: Calculating the Risk metric with SciPy’s Gaussian quadrature

## Summary

The mathematics and programming in Bayesian A/B testing are a bit more challenging than in the Frequentist framework. However, as I argued in my previous post, I think it is worth it. Although it is more computationally expensive, the added clarity and interpretability of the A/B test gives an enormous value for anyone who uses them.

In this post, I reviewed the basics of the Beta-Binomial and Normal-Normal models for Bayesian A/B testing, and presented some of the approximations we implemented at Wix.com. While these approximations may not be 100% accurate, they are “good enough” when experiments have thousands of users, and they allowed us to support Bayesian A/B testing on a large scale.

This post focused on how to calculate Bayesian metric of an A/B test, and not on how to analyze it or when to stop it. If you want to read more on those issues, the posts in my references give an excellent overview.

## References

Here are some of the posts that I read when I was first learning about Bayesian A/B testing. The two latter posts are a bit more critical on the framework, and I recommend reading them especially.

- The Power of Bayesian A/B Testing by Michael Frasco
- Bayesian A/B testing — a practical exploration with simulations by Blake Arnold
- Bayesian A/B Testing at VWO by Chris Stucchio
- Bayesian vs. Frequentist A/B Testing: What’s the Difference? by Alex Birkett
- Is Bayesian A/B Testing Immune to Peeking? Not Exactly by David Robinson

The last post is probably my favorite, I’ve read it a couple of times. Particularly, by recreating it’s simulations, I’ve learnt about the effect of choosing different priors on the error rate when stopping the Bayesian A/B test dynamically (more on that subject in the appendix).

**I’ve also built some cool R Shiny apps related to Bayesian A/B testing, using only the formulas presented in this post:**

- Beta-Binomial A/B Test Calculator
- Normal-Normal A/B Test Calculator
- A cool app comparing Runtime & Accuracy between Bayesian & Frequentist A/B testing

All the code I used in this post can be found in this directory.

## Appendix: On The Choice of the Prior

There is a niche debate about the importance of choosing a prior in Bayesian A/B testing. To my opinion, using an informative and suitable prior has great importance, and here’s why:

- I view the prior as a sort of “regularization” on the A/B test, and it plays a major role when dealing with the multiple comparisons issue. Yes — it is still an issue, even though we’re not using the Frequentist framework..
- Since we use Bayesian A/B testing sequentially (looking on the results every day and stopping once the Risk is below the threshold), using a “weak” (low) prior increases our error rate. See David Robinson’s post in my references.
- One cannot avoid the dilemma of choosing a prior by using an uninformative prior.
**Choosing an uninformative prior is still a choice of a prior**, just a very lousy one.

I think it is important to find and use an appropriate prior in a Bayesian model. At Wix, we did so by automating a “prior builder” process, where we automatically query historical data from the past few weeks and build a prior from that data, as Frasco and Arnold suggest in their posts. Only when there isn’t enough data, we fallback to an “uninformative” prior (*α=1, β=1* for the Beta distribution, *μ₀=0, σ₀=1, n₀=1* for the Gaussian distribution).

## Notes

*Ordered by descending level of importance*

- scipy.special.roots_sh_jacobi encounters an integer overflow error when
*α*or*β*are too large. I’ve copied the source code and did the “log trick” to make it work, you can see it here. I also have a PR (here) for that issue. - If you’re programming in R, you can approximate the Risk with a Gaussian quadrature using the statmod::gauss.quad.prob function.
- The parameters
*μ₀=0, σ₀=1, n₀=1***aren’t**the Gaussian distribution’s uninformative prior, but result with a very “weak” prior. The reason we don’t use the actual uninformative prior is because it doesn’t result with a conjugate prior, which makes calculations much more difficult. - We found that the Normal-Normal model suits most of our continuous KPIs. While there are more sophisticated models (for example a product of an Exponential and Beta distribution, here), we found that in a large sample the Gaussian approximation yields similar results — and its much simpler to calculate (Occam’s razor to the rescue). Once again I invite you to validate me by doing your own simulations, or look at mine here.
- In my previous post I wrote that a lot of metrics aren’t that numerically different between the Bayesian and Frequentist models. Indeed, they have much resemblance, and in fact — if they didn’t we would have been worried!

Here’s a nice example: In a large sample the moments of the Beta distribution in Eq. 5 are almost identical to the CLT approximations of the rate’s estimate and its logarithm in a proportion test. Setting*α, β = 0*for simplicity, one gets:

- Notice that in the Normal-Normal model I presented, the standard deviation
*σ₀*doesn’t have a posterior. The reason is that at Wix we don’t have an interest in inferring the standard deviation, and we treat it as a nuisance parameter. If you do need to infer it, I refer you to the Normal-Inverse-Gamma prior.

*This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.*

## Enjoy this article? Sign up for more AI updates.

We’ll let you know when we release more technical education.

Matt says

I might be having a brain freeze, but why are your averages not (conversion/total users) for each A and B? I assume you are looking at a binomial outcome of converted vs. not converted

x_a, n_a = 254, 1283 # converted and total users in A

x_b, n_b = 289, 1321 # converted and total users in B

m_a, s_a = 52.3, 14.1 # average and standard deviation over all users in A

m_b, s_b = 52.8, 13.7 # average and standard deviation over all users in B