Deep neural networks are a flexible family of models wide applications in AI and other fields. Even though these networks often encompass millions or even billions of parameters, it is still possible to train them effectively using the maximum likelihood principle as well as stochastic gradient descent techniques. Unfortunately, this learning procedure only gives us a point estimate of the parameters and it is hard to endow the model with any sort of prior knowledge about the parameters. Additionally, it is not easily possible to incorporate any stochastic elements into the models, such as samples from a predefined or learned distribution.
Bayesian variational inference provides a natural framework for these issues, since the very idea of Bayesian learning is to infer the shapes of distributions instead of point estimates of parameters. Unfortunately, the added complexity of this approach makes it hard to use in deep neural networks.
In this post, I will try to show how we can overcome these difficulties through an approach known as probabilistic programming. This method allows us to largely automatize the process of statistical inference in the models, making it easy to use without having to know all the tricks and intricacies of Bayesian inference in large models.
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
Introduction to Variational Inference
Variational inference is an essential technique in Bayesian statistics and statistical learning. It was originally developed as an alternative to Monte-Carlo techniques. Like Monte-Carlo, variational inference allows us to sample from and analyze distributions that are too complex to calculate analytically.
In variational inference, we use a distribution which is easy to sample from and adjust its parameters to resemble a target posterior distribution as closely as possible. Surprisingly, we may perform this approximation even though we do not know the target distribution exactly.
More concretely, let us assume that we are given a set of latent (i.e. unobserved) variables Z_1, …, Z_n and some data X. As an example, the data could be images, while the latent variables could represent latent factors in the images, such as if the image is a portrait, if it depicts a cat or a dog or if it is a photograph or a painting. Since the latent variables are hidden, we don’t know their values in general.
Let us now define a parametrized statistical model including both Z and X as:
In this context, P(Z) is known as the prior, while P(X|Z) is the likelihood of the data given the latents. The learning criterion is usually assumed as maximizing the log-evidence:
Hereby, P(X) is called the evidence, since it describes the probability of the data (evidence) with parameters θ. P(X) is defined as:
Unfortunately, this integral is usually intractable even for known values of θ. If we tried to maximize the log-evidence directly, we would have to calculate the integral anew for every value of θ during training.
During training, we are also interested to calculate the probability of the latent variables given the data, which is given by the Bayes theorem as:
This probability is called the posterior in Bayesian literature and the procedure of calculating it is often referred to as inference. Note that this quantity is intractable due to the evidence term in the denominator.
In variational inference, we now define a variational distribution Q(Z) to approximate the posterior for the given data:
We define Q such that we can easily sample from it. In principle, we are free to take any distribution we like, however if Q resembles P more closely, the approximation will be tighter. Notice that Q comes with its own set of parameters φ. During the training, we will try to optimize φ and θ to achieve two goals:
- maximize the log-evidence,
- make Q approximate P as closely as possible.
Although this task now seems even more difficult than the original one, I will show you how to efficiently solve it using gradient descent by defining an appropriate loss function.
As mentioned above, our goal is to make Q as close to P as possible while maximizing the log-evidence. We want to perform this optimization iteratively using gradient descent. However, we have yet to find a suitable loss function for our needs. Intuitively, we would like the loss function to include two terms:
- a term that maximizes the log-evidence, even though indirectly,
- some measure of closeness of Q and P.
It turns out that the Kullback-Leibler divergence (KL-divergence) is a good measure for the distance between distributions. It is defined as
Evidence Lower BOund (ELBO)
By shuffling the terms of this equation around (you can find the precise derivation on Wikipedia) we arrive at the equation
This equation is significant since it tells us that we can write the intractable log evidence as the KL-divergence between Q and P minus a term we will call Evidence Lower BOund (ELBO). Since the KL-divergence is non-negative, it follows that maximizing the ELBO will also maximize the evidence. Our loss function is thus:
The terms inside the expectation are easy to calculate since we have everything we need: the log-joint is simply the sum of the log-prior and the log-likelihood, and log-Q is tractable by definition. We will look at how to optimize the ELBO in the next section.
Side note: Importance-Weighted Lower Bound
Recently, a new class of log-evidence bounds emerged called Importance-Weighted Lower Bound (IWLB) defined as
This bound is equal to ELBO for K=1 and otherwise is tigher than the ELBO. For more information please refer to .
Let us now figure out how to estimate the gradient of ELBO using arbitrary stochastic functions P and Q. In particular, we want to obtain unbiased Monte-Carlo estimates of
In other words, we have to get the gradient computation inside the expectation. How do we do this in general, if the expectation depends on the gradient parameters?
If Q has a particular form, it turns out that we may circumvent the problem by reparametrizing the distribution:
As you can see, the second expectation does not depend on φ anymore. Therefore, we may pull the gradient computation inside the expectation. As an example, consider the reparameterization of the normal distribution:
Since N(0,1) does not depend on any parameters, we may freely differentiate w.r.t. the operation.
Estimating the Gradient of Discrete Distributions
Unfortunately, this trick does not work for all distributions. In particular, it fails with discrete distributions. In this case, our only hope is a so-called REINFORCE estimator. This estimator uses the following equation:
Therefore, we may rewrite the expectation gradient as
This solves our issue of differentiability and provides us with a Monte-Carlo estimator. Unfortunately, this estimator tends to have a large variance. In some cases, it is not even possible to efficiently calculate the gradient at all. Fortunately, it is often possible to reduce the variance of the estimator, using e.g. the structure of the model or a baseline reduction similar to the one used in policy gradient methods in reinforcement learning. For more details, please refer to .
Example: Variational Autoencoder
I will now present an application of the above variational framework: the variational autoencoder . The variational autoencoder is a directed probabilistic generative model. If you are unfamiliar with the basics of variational autoencoders, have a look at this great tutorial.
At runtime, the variational autoencoder takes a random value sampled from a prior P(Z) and passes it through a neural network called the decoder P(X|Z) with parameters θ to obtain the result, e.g. an image. The number of dimensions in the latent (prior) space as well as the underlying distribution can be varied to suit our dataset.
We may write the joint probability density of the variational autoencoder as
This means that for every data point Xi we have a point in latent space Zi. In the case of variational inference, we would also require the variational parameters to be different for every data point. For instance, if we used a normal distribution as the prior (as we often do), the mean and variance would be the variational parameters and they would have a different value for every input data point.
In order to avoid this and learn a single set of parameters, we introduce another neural network, the encoder Q(Z|X), to represent a variational estimate of the posterior P(Z|X). We use the encoder during the training phase to learn how to map the input X to the variational parameters. Thus, we will use a single set of neural network parameters φ to parametrize Q. This is usually referred to as amortized inference, since we amortize the inference cost across the entire dataset.
In order to learn both the optimal latent space as well as minimize the reconstruction error in the autoencoder by minimizing the ELBO in this model. Since we are usually working with a continuous distribution, we can use the reparametrization trick in order to compute the derivatives. The ELBO can be written as
The first term actually corresponds to the reconstruction error, so it could be the mean squared error. The second term minimizes the distance between the encoder and the prior given the data point.
Furthermore, since the joint distribution P(X,Z) factors over all data points, we may freely use mini-batch sampling as we usually do for feed-forward neural networks. You may also use your favorite optimizer (SGD, Adam etc.) for the optimization procedure. One implementation detail: please be aware that most implementations of optimizers only allow you to minimize a value instead of maximizing as we are doing here. In this case, you can just minimize the negative of the ELBO and you’re good to go.
This blog post focused on the applications of variational inference in deep learning. As you hopefully saw, variational inference can be automatized to a large degree by solving gradient estimation for a wide variety of distributions. However, one always has to be mindful of the estimator variance, especially when using the REINFORCE estimator. Fortunately, frameworks such as Pyro make variational inference and probabilistic reasoning simple to use and often also take care of variance reduction and other tricks.
 Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. “Importance weighted autoencoders.” arXiv preprint arXiv:1509.00519 (2015).
 Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.