I want to talk about technical approaches to mitigating algorithmic bias.

It’s 2019, and the majority of the ML community is finally publicly acknowledging the prevalence and consequences of bias in ML models. For years, dozens of reports by organizations such as ProPublica and the New York Times have been exposing the scale of algorithmic discrimination in criminal risk assessment, predictive policing, credit lending, hiring, and more. Knowingly or not, we as ML researchers and engineers have not only become complicit in a broader sociopolitical project that perpetuates hierarchy and exacerbates inequality, but are now actively responsible for disproportionate prison sentencing of black people and housing discrimination against communities of color.

Acknowledgement of this bias cannot be where the conversation ends. I have argued, and continue to argue, that **even the individual ML engineer has direct agency in shaping fairness in these automated systems**. Bias may be a human problem, but *amplification* of bias is a technical problem — a mathematically explainable and controllable byproduct of the way models are trained. Consequently, mitigation of existing bias is also a technical problem: **how, algorithmically, can we ensure that the models we build are not reflecting and magnifying human biases in data?**

Unfortunately, it is not always immediately possible to collect “better training data.” In this post, I give an overview of the following algorithmic approaches to mitigating bias, which I hope can be useful for the individual practitioner who wants to take action:

**adversarial de-biasing**of models through the protection of sensitive attributes,- encoding invariant representations with semi-supervised,
**variational “fair” autoencoders**, **dynamic upsampling of training data**based on learned latent representations, and- preventing disparity amplification through
**distributionally robust optimization.**

I do my best to link to libraries/code/tutorials/resources throughout, but if you’re looking to dive right in with code, the AI 360 toolkit looks like a decent place to start. Meanwhile, let’s get started with the math :).

**If these in-depth educational content is useful for you, you can subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new research updates.**

## I. Adversarial De-biasing

The technique of **adversarial de-biasing** is currently one of the most popular techniques to combat bias. It relies on adversarial training to remove bias from latent representations learned by the model.

Let *Z* be some sensitive attribute that we want to prevent our algorithm from discriminating on, e.g. age or race. It is typically insufficient to simply remove *Z* from our training data, because it is often highly correlated with other features. What we really want is to prevent our model from learning a representation of the input that relies on *Z* in any substantial way. To this end, we train our model to **simultaneously predict the label Y and prevent a jointly-trained adversary from predicting Z**.

The intuition is as follows: if our original model produces a representation of *X* that primarily encodes information about *Z* (e.g. race), an adversarial model could easily recover and predict *Z* using that representation. By the contrapositive, if the adversary fails to recover any information about *Z*, then we must have successfully learned a representation of the input that is *not substantially dependent on our protected attribute.*

We can think of our model as a multi-head deep neural net with one head for predicting Y and another head for predicting Z. We backpropagate normally, except we send back a *negative* signal on the head that predicts Z by using the negative gradient.

Formally, consider *g*(X) to be the shared learned embedding of our input. We let *f* be our prediction function where Y=*f*(*g*(X)), and *a* be our adversary where Z=*a*(*g*(X)). The ultimate goal is for our neural net to **learn a representation g(X) that makes it easy for f to predict Y, but difficult for a to predict Z**. In optimization terms, we want to minimize our prediction loss L_y(

*f*(

*g*(X)), Y) and maximize the adversarial loss L_z(

*a*(

*g*(X)), Z).

To unify these optimizations and control the trade-off between the two, we introduce J_λ, which is the identity function with a negative gradient, that is:

- J(
*g*(X)) =*g*(X), and - dJ/dX = -λ d
*g*(X)/dX.

Using J_λ, where λ is a hyperparameter that determines the tradeoff between model accuracy and removing sensitive information, we can train our neural net with the overall optimization objective:

In practice, it can be easier and more effective to treat the adversary as a separate neural network with its own weights. In this case, we feed the output of the final layer of the predictor network into the input of the adversarial network.

Since the adversary is just trying to minimize its own loss, we update its weights U according to the gradient ∇_uL_A as usual. However, the update expression for the predictor’s weights W gets a little more complicated:

The first term in the expression is the gradient of the predictor’s loss, and is included to minimize prediction loss. The third term is the negative gradient of the adversary *with respect to the predictor’s weights*, and is included to maximize the adversary’s loss. (Note that in this case, α is the hyperparameter that controls the accuracy/de-biasing tradeoff). The novelty of this formulation lies in the projection term in the middle. This term says: remove *all* components of the predictor’s gradient update that would contribute to the adversary. Without this term, it is still possible for the predictor to make updates that still benefit the adversary, as shown in the image below:

Many variants and improvements of the adversarial de-biasing approach exist beyond what I have described here. Just a few days ago, I was chatting to a researcher at CVPR whose proposed model maximizes the *entropy* of the adversary’s predictions as opposed to the loss, i.e. the network seeks to maximally *confuse* the adversary instead of playing a zero-sum game trading off prediction accuracy and bias removal. Depending on your use case, variations such as this may be worth experimenting with as well.

#### Wow, adversarial debiasing is so cool! How can I use it/learn more?

**Code**: Colab tutorial; library.**Papers**: Beutel, 2017; Zhang, 2018.

## II. Variational “Fair” Autoencoders

Another technique that learns a “fair” representation of data is the **Variational Fair Autoencoder (VFAE). **VFAE characterizes “fairness” as a representation that is *explicitly **invariant** with respect to some known aspect of the dataset*. Conceptually, this technique is one of my favorites, because it very neatly ties together ideas in ML bias and the broader field of unsupervised and semi-supervised representation learning. Practically, because it is a semi-supervised approach, it can be especially useful in taking advantage of unlabelled data.

#### Quick VAE refresher

If you are unfamiliar with the variational autoencoder (VAE), this blog post provides an great intuition and Stefano Ermon’s class notes do a thorough job building up the math. However, since the VAE is integral to both this approach and the next one, I want to provide a quick summary here as well.

The VAE is a generative model that learns a *parameterization* of some latent representation of an input. We assume that there is some set of latent variables z that explain x; we have an “encoder” network q(z|x) that produces the μ and σ of a normal distribution from which z can be sampled, and a “decoder” network p(x|z) that attempts to *reconstruct* x from the encoding z. The key here is that we want to learn parameters for the encoder and decoder that maximize the likelihood of our data, but it is often the case that computing the posterior p(z|x) is intractable. As a result, we train our neural nets to maximize a lower bound of p(x) instead:

The first term of this lower bound is our “reconstruction loss,” which evaluates: *how well can our decoder recover the input from the encoding we sampled*? The second term is a KL divergence regularizer that encourages *q* to produce encodings that are similar to the standard normal defined by p(z), which evaluates: *does our encoder produce meaningfully clustered encodings? *Training on both objectives together encourages the model to discover latent spaces that both **accurately capture the semantics of our data** and **generalizes/interpolates beyond our data**.

#### Back to algorithmic fairness: putting the F in VFAE

We begin formulating the VFAE as an unsupervised model which assumes that each datapoint x can be probabilistically generated from two “sources”: an observed s, which represents the variables we want to remove, and a continuous latent variable z that encodes the remaining information.

Intuitively, we can think of the problem of learning an “unbiased representation” as recovering the *underlying, probabilistic sources of information behind this input*, in a way that explicitly separates out the sensitive sources (s) from the invariant ones (z). Note that this method, like the adversarial debiasing method discussed earlier, requires you to *know* and specify what the sensitive sources or features are.

Using the lower bound of the VAE described above, we can optimize the decoder p_θ and encoder q_φ to obtain z, an encoding that captures the most salient, non-sensitive information about x :

Unfortunately, the unsupervised model has a glaring problem: if the sensitive variable *s* and the label *y* are actually correlated, we may end up with *degenerate* representations of z with respect to y —that is, representations that do not actually encode salient information about the label. To solve this problem, we introduce a second layer of variables to try to correlate z with y, consisting of the label y itself and another latent variable z2 that encodes the remaining variation in z that is *not* explained by y. Note that in the following figures and expressions, z is now denoted as z1.

Intuitively, z2 will capture and “explain away” the x-dependent noise, leaving our desired representation z1 solely responsible for *finding an invariant representation of x that can explain the label y.*

The neat thing about “injecting” information about the label during the feature extraction phase is that even if the label isn’t present, we can use the decoder q_φ(y|z1) learned from the data where y *is* present to impute missing data. In other words, this technique extends easily to semi-supervised learning. The full mathematical derivation is a little too long to include here, but can be found in section 2.2 of the original paper.

The final contribution of this approach is using **Maximum Mean Discrepancy** as a further regularizer to ensure “fairness” of the model. While injecting labels is useful for preventing degenerate representations, if the label y happens to be correlated with the sensitive variable s, it is possible for information about that sensitive variable to still “leak” into our representation through y. To address this problem, **we introduce one final penalization term on the lower bound to explicitly encourage z1 to be invariant with respect to s**; specifically, we encourage the distributions q_φ(z1|s = 0) and q_φ(z1|s = 1) to be similar.

This penalization term is the max mean discrepancy (MMD), which measures the squared distance between the *mean embeddings* of two distributions (or their respective feature mappings). The VFAE’s final loss term simply compares the empirical statistics ψ(·) of z1 when s=0 and when s=1.

Formally, our final VFAE model extends the variational autoencoder architecture over both labelled data (x_n, s_n, y_n) and unlabelled data (x_m, s_m), and uses MMD to ensure invariance of our learned representations Z with respect to s:

#### Ok, I’m convinced. How can I dive in with VFAE’s?

**Code**: research code (disclaimer: this is in Theano, which at this point just hurts); Jupyter notebook (tensorflow).**Paper**: Louizos, 2017.

## III. Dynamic Upsampling of Training Data

Our first two techniques both modify the *learned representation *of models. They use outcomes — the ability to either predict or correlate with sensitive attributes — to regularize the learned representation of the data. This next approach, known as the **Debiasing Variational Autoencoder, **actually *uses learned* representations to rebalance the training data. Its premise is quite intuitive: since many modern ML systems fail on certain demographics due to a lack of appropriate representation in training data, let the model **learn which inputs come from under-represented groups and sample those inputs more frequently during training.**

The greatest advantage of this approach is that unlike the first two, it does not require you to know or specify the sensitive attributes in your data; *the model will learn them automatically as it trains*. As a result, the model will also be free to learn more complex and nuanced sources of “under-representation” than a human annotator could easily specify. In facial recognition, for example, it may be easy for humans to identify which segments of the population are under-represented in training data, but it is a lot harder to specify which poses or facial expressions are featured too infrequently to predict with good accuracy.

The Debiasing Variational Autoencoder proposes a very simple fix: **if a datapoint is in an under-represented region of the latent space, upsample it in training**. There are two components to this approach we need to understand: 1) the modified VAE model architecture and 2) the data re-weighting algorithm.

The VAE has a standard encoder q_φ(z|x) to approximate the latent variables, except instead of producing the standard *2k* activations for *k* latent variables (i.e. a μ and σ for each latent variable), it produces *2k+d* outputs where *d* is the dimensionality of the label y. In other words, **the VAE encoder simultaneously performs the original prediction task and learns the latent representations of the training data**. We also learn a standard decoder p_θ(x|z) that reconstructs the input from the learned latent variables.

Altogether, our network is trained with 3 losses: a supervised loss for the prediction task, a reconstruction loss for the reconstructed input, and a KL divergence loss for the latent variables.

The overall architecture looks something like this:

Now, to upweight training points that are underrepresented in the latent space, we need to figure out which data points have uncommon or infrequently occurring latent representations. To this end, we estimate the *frequency distribution* of variables in the latent space (z). Notice that given a dataset X, we can approximate that distribution with a histogram Q_hat(z|X). This histogram can become very large in dimension as the number of latent variables *k* increases, so we further approximate this joint distribution with independent histograms for each latent z_i:

For an intuitive interpretation of the histogram Q_i(z_i|X), recall that each latent variable z_i is parameterized by a μ_i and σ_i. Further recall that since z_i is conditioned on x, every input x has its own mean and standard deviation for that latent variable. The histogram Q_i counts up the frequencies of those means and standard deviations *across the entire dataset*. Given any input x, if Q_i(z_i(x)|X) is small, we know it usually takes on uncommon values of z_i. If Q_j(z_j(x)|X) is small across all (or many) latent variables z_j, we know x has an uncommon overall latent representation*.*

Using these histograms, we can make the probability of selecting a datapoint x for training **inversely proportional to the frequency of its representation in the latent space:**

The model will eventually learn to assign a small sample probability W(z(x)|X) to standard images, and assign a larger sample probability to non-standard ones:

#### Great, I’m ready to dynamically re-balance all my ML datasets.

**Code**: unfortunately I could not find the code for this anywhere :(, but I have emailed the authors of the paper to ask. Meanwhile, this algorithm is another example of data re-weighting, except it is done in the pre-processing stage using class labels only (as opposed to dynamically at train-time using latent representations).**Paper:**Amini, 2019.

## IV. Distributionally Robust Optimization

Our final bias mitigation strategy directly modifies the optimization objective. It proposes the use of **distributionally robust optimization (DRO)**, which essentially **minimizes the worst-case loss of each group in the dataset**, in place of the status quo of empirical risk minimization (ERM), which minimizes average loss in the dataset. It is probably the most general of all approaches described: it is model-agnostic, and does not require us to know the identity of the protected group(s) nor their proportions in the dataset.

Before we begin, we need to clarify a few terms that drive the motivation behind DRO. *Representation disparity* refers to the phenomenon in which a model achieves high overall accuracy but low minority accuracy. In our running example of facial recognition, this occurs when the recognition engine performs well overall but performs poorly for certain minority groups. *Disparity amplification*, on the other hand, refers to the phenomena in which representation disparity is somehow amplified over time through a positive feedback loop in the ML model. For example, minority groups might stop using the facial recognition engine due to poor performance, and thus stop contributing data. Similarly, officers using predictive policing algorithms likely disproportionately target black neighborhoods from the start and thus catch more crime in those neighborhoods, reinforcing the model’s initial (historical and human) bias.

The argument behind this approach is that representation disparity occurs as a result of optimizing for average loss. Instead, by optimizing for a worst-case group, DRO can succeed in examples when ERM becomes unfair and even prevent disparity amplification in such models.

Formally, we consider a population with K latent groups, where each group k makes up a proportion ⍺_k of the population and has an underlying distribution P_k. We assume we know neither the proportion nor underlying distribution of any demographic. Each input Z *~ *P_k is queried with some loss, and the “risk” of each group — denoted R_k(θ) — is the expected loss over those inputs Z.

The goal of DRO is to control the worst-case risk over all K groups, which is:

Notice, however, that we cannot directly optimize using R_max, because we do not know the identity — and thus the underlying distribution P_k — of any group in our dataset. This leaves us with an obvious question: how do we minimize the worst-case risk over each group when we do not know what the distributions of these groups look like?

The answer is that we simply consider the worst-case loss over *all* possible distributions that are reasonably close to our observed, overall population distribution. Specifically, we consider all distributions *Q* which are within some chi-squared radius *r* of the empirical distribution P:

Since we also do not know the proportion of the worst-off group, we choose this radius *r* based on the smallest minority proportion ⍺_*min* that we are willing to consider. It can be shown that the maximum robustness radius we need to consider would be r*_max* = (1/⍺_*min –*1)².

Intuitively, we are now optimizing for the **worst-case underlying distribution(s)** that make up at least **⍺_ min percent of the known overall distribution P.**

But what does any of this look like in terms of behavior of the optimization process? It can be shown that under certain constraints, R_dro(θ;r*_max*) is equivalent to the following (far more useful) expression:

where C=(2*r_*max*+1)^(1/2), and η is a hyperparameter. This expression makes the behavior of DRO far more apparent: **all data points that suffer less than η levels of loss are ignored, while data points with a loss much larger than η are upweighted due to the squared term.** The following figure illustrates this behavior with using DRO in practice:

From panel (a) of this figure, we see our dataset is a mixture of two underlying distributions, one of which heavily outnumbers the other. The ERM estimate, which optimizes for average loss, will naturally just optimize for the majority group and produce a θ_erm that is completely unsuitable for the other distribution. Meanwhile, if we *knew* which data points belonged to which groups and could optimize for the worst-case risk in each group using R_*max*, we would end up with a θ_fair that is equidistant to both groups. Without any knowledge of group membership, **DRO simply ignores all examples with a loss smaller than η* and upweights high-loss examples on the tail end of all distributions**. As a result, DRO finds an optimal θ_dro* that is near θ_fair and is only slightly skewed towards the majority group.

Tangibly, to train any neural net using DRO, we simply choose a value of η and compute the optimal θ:

Since η is a hyperparameter, we can perform a binary search over possible values to find the global optimum.

#### I’m ready to (empirically) champion the underdog with DRO!

**Code**: Codalab Worksheet (shout-out to Percy for making his grad students stay on top of reproducible research).**Paper**: Hashimoto, 2018.

## Conclusion

I hope these algorithms I’ve described can be useful, but I also want to emphasize that **algorithmic fairness cannot be the end of the story**. Even the best technology serves as a means to an end, and in a world that is the product of centuries of structural inequality, oppression, and exploitation, the question of “*whose *end” tends not to yield surprising answers. **Just because an algorithm is fair does not mean it is used fairly**; it is our responsibility to critically interrogate who benefits from these technologies at whose costs, and to be vocal about how our models should or should not be used. To this end, I highly encourage everyone to check out the following:

- “Model Cards for Model Reporting” (2019), a proposal for a standardized short document to accompany models detailing their intended usage, evaluation results, ethical considerations and caveats, etc.,
- the AI Now Institute’s proposed Algorithmic Impact Assessment framework (2018),
- a New Yorker analysis of the nation’s very first bill regulating “automated decision systems” in NYC (2017), and the legal, political, and corporate stumbling blocks it ran into, and
- the “Algorithmic Accountability Act” proposed to Senate in April 2019, and a corresponding opinion article discussing potential shortcomings.

*This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.*

## Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

## Leave a Reply

You must be logged in to post a comment.