The post AAAI 2021: Top Research Papers With Business Applications appeared first on TOPBOTS.

]]>To help you stay aware of the prominent AI research breakthroughs, we’ve summarized some of the most interesting AAAI 2021 research papers introduced by Google, Alibaba, Baidu, and other leading research teams.

If you’d like to skip around, here are the papers we featured:

- Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
- TabNet: Attentive Interpretable Tabular Learning
- Train a One-Million-Way Instance Classifier for Unsupervised Visual Representation Learning
- ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs
- Reinforced Imitative Graph Representation Learning for Mobile User Profiling: An Adversarial Training Perspective

**If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material. **

Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, such as quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a *ProbSparse* Self-attention mechanism, which achieves *O(L *log *L)* in time complexity and memory usage, and has comparable performance on sequences’ dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem.

The current Transformer architectures are inefficient for long sequence time-series forecasting (LSTF), where a model needs to learn long-range input-output dependencies and also to offer inference speeds feasible for predicting more steps into the future (e.g., 480 points for hourly temperature records over 20 days). To make the architecture feasible for long sequential inputs, the authors proposed the **ProbSparse Self-attention** mechanism with *O(L *log *L)* complexity rather than *O(L**2**) *complexity, where *L* is the length of the sequence. A self-distillation method is proposed to scale the network efficiently for better accuracy, with *O((2 − ε)L *log *L)* complexity in place of the *O(J · L**2**)* complexity of a regular Transformer, where *J *is the number of transformer layers. A generative-style decoder is adapted to increase inference speeds compared to a step-by-step prediction of every point in the output. The proposed method is shown to perform better than existing methods in five real-world datasets for tasks including predicting ETT (Electricity Transformer Temperature), ECL (Electricity Consuming Load), and Weather.

**ProbSparse Self-attention**is proposed to take advantage of the sparsity or long-tail distribution of self-attention probabilities where only a few key-query attention weights drive the majority of the computation. ProbSparse achieves*O(L*log*L)*complexity, improving on*O(L**2**)*.- To scale the model by stacking transformer layers, the authors proposed a self-distillation technique using convolution and max-pooling operations so that the output size of the current layer, i.e. the input size of the next layer, is less than the input size of the current layer. This achieves
*O((2 − ε)L*log*L)*complexity, compared to*O(J · L**2**)*for a general Transformer. - Finally, to make the inference speeds scalable, a generative-style decoder is proposed to predict multiple points into the future in a single forward pass.

- The proposed method achieves superior performance on five real-world datasets for both univariate and multivariate long sequence time-series forecasting for tasks such as predicting ETT (Electricity Transformer Temperature), ECL (Electricity Consuming Load), and Weather.

- The paper received the Outstanding Paper Award at AAAI 2021.

- The proposed approach can be used to predict long sequences, including energy consumption, weather indicators, stock prices, etc.

- The original PyTorch implementation of this paper is available on GitHub.

We propose a novel high-performance and interpretable canonical deep tabular data learning architecture, TabNet. TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of non-performance-saturated tabular datasets and yields interpretable feature attributions plus insights into the global model behavior. Finally, for the first time to our knowledge, we demonstrate self-supervised learning for tabular data, significantly improving performance with unsupervised representation learning when unlabeled data is abundant.

The Google Cloud AI team addresses the problem of applying deep neural networks for tabular data. While deep neural networks shine at automatic extraction of features and end-to-end learning, the lack of inductive bias for modeling the output decision boundaries that are prevalent in tabular data and the lack of interpretability limit the widespread adoption of deep neural networks for tabular data. The authors devise a sequential attention mechanism to select a subset of features to process at each step. This improves learning efficiency and interpretability by demonstrating the reasoning at each step, similarly to a decision tree. The feature selection is done for each instance to increase the model performance with more data. Unsupervised pre-training is also used to increase the performance with a task of predicting the masked values at different rows of different columns. The proposed TabNet model performs better than or on par with the standard methods for tabular data while eliminating the feature selection and feature engineering steps.

- Devising a sequential attention mechanism that attends to only a subset of features while masking the others at each step before processing. This helps in efficient learning, as the model processes only salient features, and also with interpretability, as the reasoning steps could be analyzed based on the selected features.
- Unsupervised pre-training is shown to be useful in increasing the performance of the model by predicting masked values. This increased performance is out of reach for traditional ML models as they couldn’t be pre-trained in an unsupervised way.

- Experiments show that the proposed method, TabNet, performs as well as or better than established tabular data models on five real-world datasets while dealing with the interpretability concerns.

- The approach can be useful for any applications working with tabular data, which is likely the most common data type in real-world machine learning applications.

- The PyTorch implementation of this paper is available on GitHub.

This paper presents a simple unsupervised visual representation learning method with a pretext task of discriminating all images in a dataset using a parametric, instance-level classifier. The overall framework is a replica of a supervised classification model, where semantic classes (e.g., dog, bird, and ship) are replaced by instance IDs. However, scaling up the classification task from thousands of semantic labels to millions of instance labels brings specific challenges including 1) the large-scale softmax computation; 2) the slow convergence due to the infrequent visiting of instance samples; and 3) the massive number of negative classes that can be noisy. This work presents several novel techniques to handle these difficulties. First, we introduce a hybrid parallel training framework to make large-scale training feasible. Second, we present a raw-feature initialization mechanism for classification weights, which we assume offers a contrastive prior for instance discrimination and can clearly speed up converge in our experiments. Finally, we propose to smooth the labels of a few hardest classes to avoid optimizing over very similar negative pairs. While being conceptually simple, our framework achieves competitive or superior performance compared to state-of-the-art unsupervised approaches, i.e., SimCLR, MoCoV2, and PIC under ImageNet linear evaluation protocol and on several downstream visual tasks, verifying that full instance classification is a strong pretraining technique for many semantic visual tasks.

Unsupervised representation learning has proved beneficial when we have a lot of data but few labels or when the task is not fully defined yet. The Alibaba research team addresses the problem of seamless unsupervised representation learning without the need to create negative pairs or new objective functions. The proposed method treats unsupervised representation learning as an instance-level supervised classification task, implying that all the images are assigned a unique class and an *n*-way classification model is trained, where *n* is the total number of images in the dataset. The authors also proposed novel techniques to deal with this large-scale classification task, including model parallel techniques for softmax computation, a technique to induce a contrastive prior, and a technique to smooth the ground truths of very similar negative classes. The method outperforms previous state-of-the-art models for unsupervised representation learning such as SimCLR and PIC.

- Treating unsupervised representation learning as a large-scale instance-level classification task.
- Proposing novel techniques to handle large-scale classification tasks:
- introducing a hybrid parallel training framework for computing softmax operation on different devices;
- inducing a contrastive prior by presenting a raw-feature initialization mechanism for classification weights (i.e., the weights are initialized with the instance features that were extracted by running an inference epoch, where the model is a fixed random neural network with only batch-normalization layers being trained);
- smoothing the ground truths of very similar negative classes.

- This work devised a new, simple and efficient method for unsupervised representation learning without the use of negative pairs in class-level contrastive learning or large batch sizes to mitigate data leakage in instance-level contrastive learning.

- This method could be used to cluster unlabeled images, which in turn facilitates similar image search and image tagging for image archival systems.

We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured knowledge obtained from scene graphs to learn joint representations of vision-language. ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal tasks. Utilizing scene graphs of visual scenes, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks in the pre-training phase. Specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language. After pre-training on large scale image-text aligned datasets, we validate the effectiveness of ERNIE-ViL on 5 cross-modal downstream tasks. ERNIE-ViL achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%.

In this work, the Baidu research team tried to solve the alignment of semantic concepts in visual and linguistic space so that the models perform better at multi-modal tasks that require common sense physical reasoning (e.g., visual commonsense reasoning and visual question answering). The authors aimed at giving the models more structured knowledge about the scenes by pre-training the models to explicitly predict objects, their attributes, and object-object relations. With an image and a corresponding text, instead of masking and predicting random tokens in the text, the authors used scene-graph parsing and masked tokens that specifically represented objects, their attributes, and object-object relations. The model was pre-trained to predict the masked tokens in the text given an image. The introduced approach achieved state-of-the-art results in multi-modal datasets for text retrieval and image retrieval and also ranked first place on the VCR task leaderboard with an improvement of 3.7% compared to the next best solution.

- Similar to BERT-like masked language modeling, image captioning models are trained to predict the masked tokens in an image caption given the image and other tokens. The core idea of this paper is to selectively mask the tokens rather than masking them randomly.
- In this approach, only the tokens that represent semantically rich entities such as objects, attributes of an object, and object-object relations are masked. This achieves better semantic alignment between text and images as all the learning focuses on semantically rich tokens in an image caption.

- Getting better grounding of the semantic textual entities in visual space.
- Achieving state-of-the-art results in image/text retrieval and visual common sense reasoning tasks.

- Incorporating scene graphs extracted from images into cross-modal pretraining.
- Using graph neural networks for representing images and text.

- Better alignment of semantic concepts would give better results for image retrieval with text, image captioning, visual question answering, and forecasting future actions.

In this paper, we study the problem of mobile user profiling, which is a critical component for quantifying users’ characteristics in the human mobility modeling pipeline. Human mobility is a sequential decision-making process dependent on the users’ dynamic interests. With accurate user profiles, the predictive model can perfectly reproduce users’ mobility trajectories. In the reverse direction, once the predictive model can imitate users’ mobility patterns, the learned user profiles are also optimal. Such intuition motivates us to propose an imitation-based mobile user profiling framework by exploiting reinforcement learning, in which the agent is trained to precisely imitate users’ mobility patterns for optimal user profiles. Specifically, the proposed framework includes two modules: (1) representation module, which produces state combining user profiles and spatio-temporal context in real-time; (2) imitation module, where Deep Q-network (DQN) imitates the user behavior (action) based on the state that is produced by the representation module. However, there are two challenges in running the framework effectively. First, epsilon-greedy strategy in DQN makes use of the exploration-exploitation trade-off by randomly pick actions with the epsilon probability. Such randomness feeds back to the representation module, causing the learned user profiles unstable. To solve the problem, we propose an adversarial training strategy to guarantee the robustness of the representation module. Second, the representation module updates users’ profiles in an incremental manner, requiring integrating the temporal effects of user profiles. Inspired by Long-short Term Memory (LSTM), we introduce a gated mechanism to incorporate new and old user characteristics into the user profile.

Better mobile user profiling that could accurately predict where a user goes next would help in better personalization of virtual assistant features and ads for relevant services, among other use cases. Modeling user behavior from past data and achieving mobile user profiling presents a lot of challenges, including the dynamic interests of users changing over time and the difficulty of modeling the spatio-temporal context of mobility in real time. This work addresses the problem of mobile user profiling by building a reinforcement learning (RL)-powered agent that could imitate a user’s decisions, i.e., predict their next steps accurately. The authors achieved accurate mobile user profiling by accurately predicting user behavior because, as accurate user profiling predicts the future behavior of the user, accurate prediction of future behavior of the user also achieves accurate mobile profiling. The proposed method achieves superior results compared to existing methods on two large-scale real-world datasets collected from New York and Beijing.

- To predict future user behavior, the authors introduce a RL-powered imitation learning method termed
**Reinforcement Imitative Representation Learning (RIRL)**. Imitation learning is achieved using adversarial training where a generator, the imitating agent, predicts the user behavior and a discriminator tries to learn to distinguish which behavior is predicted by the generator and which is from the real-world data. The imitating agent predicts the future user behavior accurately after the generator and discriminator are trained. - Graph neural networks are used to represent the spatio-temporal nature of mobile user behavior, which is better than encoding it as a sequence or just a list of visited places.
- A Long Short-Term Memory (LSTM)-inspired RNN variant is devised and used to model the dynamic nature of user interests with a gating mechanism to retain only the relevant information from the past. The state vector for the RL imitation agent is generated with representations from this RNN variant and graph neural networks.

- Better mobile user profiling by predicting future user behavior using an RL-powered imitation agent trained adversarially.
- Better results than existing methods on multiple real-world datasets.

- Accurately predicting where a person will go next opens up an interesting set of business applications like:
- recommending offers, restaurants, or services based on the location;
- better personalized virtual assistant features;
- automation of useful tasks through IoT devices at home just before the user comes home.

We’ll let you know when we release more summary articles like this one.

The post AAAI 2021: Top Research Papers With Business Applications appeared first on TOPBOTS.

]]>The post A Comprehensive Introduction to Bayesian Deep Learning appeared first on TOPBOTS.

]]>- Preamble
- Neural Network Generalization
- Back to Basics: The Bayesian Approach
- How to Use a Posterior in Practice?
- Bayesian Deep Learning
- Back to the Paper
- Final Words

Bayesian (deep) learning has always intrigued and intimidated me. Perhaps because it leans heavily on probabilistic theory, which can be daunting. I noticed that even though I knew basic probability theory, I had a hard time understanding and connecting that to modern Bayesian deep learning research. The aim of this blogpost is to bridge that gap and provide a comprehensive introduction.

Instead of starting with the basics, I will start with an incredible NeurIPS 2020 paper on Bayesian deep learning and generalization by Andrew Wilson and Pavel Izmailov (NYU) called Bayesian Deep Learning and a Probabilistic Perspective of Generalization. This paper serves as a tangible starting point in which we naturally encounter Bayesian concepts in the wild. I hope this makes the Bayesian perspective more concrete and speaks to its relevance.

**If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material. **

I will start with the paper abstract and introduction to set the stage. As we encounter Bayesian concepts, I will zoom out to give a comprehensive overview with plenty of intuition, both from a probabilistic as well as ML/function approximation perspective. Finally, and throughout this entire post, I’ll circle back to and connect with the paper.

I hope you will walk away not only feeling at least slightly Bayesian, but also with an understanding of the paper’s numerous contributions, and generalization in general

If your Bayesian is a bit rusty the abstract might seem rather cryptic. The first two sentences are of particular importance to our general understanding of Bayesian DL. The middle part presents three technical contributions. The last two highlighted sentences provide a primer on new insights into mysterious neural network phenomena. I’ll cover everything, but first things first: the paper’s introduction.

An important question in the introduction is how and why neural networks generalize. The authors argue that

“From a probabilistic perspective, generalization depends largely on two properties, the support and the inductive biases of a model.”

Support is the *range** of dataset classes* that a model can support. In other words; the range of functions a model can represent, where a function is trying to represent the data generative process. The inductive bias defines how **good** a model class is at fitting a specific dataset class (e.g. images, text, numerical features). The authors call this, quite nicely, the “distribution of support”. In other words, model class performance (~inductive bias) distributed over the range of all possible datasets (support).

Let’s look at the examples the authors provide. A linear function has truncated support as it cannot even represent a quadratic function. An MLP is highly flexible but distributes its support across datasets too evenly to be interesting for many image datasets. A convolutional neural network exhibits a good balance between support and inductive bias for image recognition. Figure 2a illustrates this nicely.

The vertical axis represents what I naively explained as “how good a model is at fitting a specific dataset”. It actually is *Bayesian evidence*, or *marginal likelihood*; our first Bayesian concept! We’ll dive into it in the next section. Let’s first finish our line of thought.

A good model not only needs a large support to be **able** to represent the true solution, but also the right inductive bias to actually **arrive** at that solution. The *Bayesian posterior*, think of it as our model for now, should contract to the right solution due to the right inductive bias. However, the prior hypothesis space should be broad enough such that the true model is functionally possible (broad support). The illustrations below demonstrate this for the three example models. From left to right we see a CNN in green, linear function in purple, and MLP in pink.

At this point in the introduction, similar to first sentence of the abstract, the authors stress that

”The key distinguishing property of a Bayesian approach is marginalization instead of optimization, where we represent solutions given by all settings of parameters weighted by their posterior probabilities, rather than bet everything on a single setting of parameters.”

The time is ripe to dig into marginalization vs optimization, and broaden our general understanding of the Bayesian approach. We’ll touch on terms like the posterior, prior and predictive distribution, the marginal likelihood and bayesian evidence, bayesian model averaging, bayesian inference and more.

We can find claims about marginalization being at the core of Bayesian statistics everywhere. Even in Bishop’s ML bible Pattern Recognition and Machine Learning. The opposite to the Bayesian perspective is the frequentist perspective. This is what you encounter in most machine learning literature. It’s also easier to grasp. Let’s start there.

The frequentist approach to machine learning is to *optimize* a loss function to obtain an optimal setting of the model parameters. An example loss function is cross-entropy, used for classification tasks such as object detection or machine translation. The most commonly used optimization techniques are variations on (stochastic) gradient descent. In SGD the model parameters are iteratively updated in the direction of the steepest descent in loss space. This direction is determined by the gradient of the loss with respect to the parameters. The desired result is that for the same or similar inputs, this new parameter setting causes the output to closer represent the target value. In the case of neural networks, gradients are often computed using a computational trick called backpropagation.

From a probabilistic perspective, frequentists are trying to **maximize** the *likelihood* . In plain english: to pick our parameters such that they maximize the probability of the observed dataset given our choice of model (Bishop, Chapter 1.2.3). is often left out for simplicity. From a probabilistic perspective, a (statistical) model is simply a probability distribution over data (Bishop, Chapter 3.4). For example; a language model outputs a distribution over a vocabulary, indicating how likely each word is to be the next word. It turns out this frequentist way of **maximum likelihood estimation** (**MLE**) to obtain, or “train”, predictive models can be viewed from a larger Bayesian context. In fact, MLE can be considered a special case of maximum a posteriori estimation (MAP, which I’ll discuss shortly) using a uniform prior.

A crucial property of the Bayesian approach is to realistically **quantify** **uncertainty**. This is vital in real world applications that require us to trust model predictions. So, instead of a parameter **point estimate,** a Bayesian approach defines a full probability distribution over parameters. We call this the *posterior distribution*. The posterior represents our **belief/hypothesis/uncertainty** about the value of each parameter (setting). We use **Bayes’ Theorem** to compute the posterior. This theorem lies at the heart of Bayesian ML — hence the name — and can be derived using simple rules of probability.

We start with specifying a *prior distribution* over the parameters to capture our **belief** about what our model parameters should look like **prior to** observing any data.

Then, using our dataset, we can **update** (multiply) our prior belief with the *likelihood *. This likelihood is the same quantity we saw in the frequentist approach. It tells us how well the observed data is explained by a specific parameter setting *w*. In other words; how good our model is at *fitting* or *generating* that dataset. The likelihood is a function of our parameters .

To obtain a valid posterior probability **distribution**, however, the product between the likelihood and the prior must be evaluated for each parameter setting, and normalized. This means **marginalizing** (summing or integrating) over **all** parameter settings. The normalizing constant is called the *Bayesian (model) evidence* or *marginal likelihood* .

These names are quite intuitive as provides **evidence** for how good our model (i.e. how likely the data) is *as a whole*. With “model as a whole” I mean taking into account **all possible** parameter settings. In other words: marginalizing over them. We sometimes explicitly include the model choice in the evidence as . This enables us to compare different models with different parameter spaces. In fact, this comparison is exactly what happens in the paper when comparing the support and inductive bias between a CNN, MLP and linear model!

We’ve now arrived at the core of the matter. *Bayesian inference* is the learning process of finding (inferring) the posterior **distribution** over . This contrasts with trying to find the **optimal** using optimization through differentiation, the learning process for frequentists.

As we now know, to compute the full posterior we must **marginalize** over the whole parameter space. In practice this is often impossible (intractable) as we can have infinitely many such settings. *This is why a Bayesian approach is fundamentally about marginalization instead of optimization*.

The intractable integral in the posterior leads to a different family of methods to learn parameter values with. Instead of gradient descent, Bayesianists often uses **sampling** methods such as Markov Chain Monte Carlo (MCMC), or **variational inference**; techniques that try to mimic the posterior using a simpler, tractable family of distributions. Similar techniques are often used for generative models such as VAEs. A relatively new method to approximate a complex distribution is **normalizing flows**.

Now that we understand the Bayesian posterior distribution, how do we actually use it in practice? What if we want to predict, say, the next word, let’s call given an unseen sentence ?

Well, we could simply take the posterior distribution over our parameters for our model and pick the parameter setting that has the highest probability assigned to it (the distribution’s mode). This method is called **Maximum A Posteriori** or **MAP** estimation. But… It would be quite a waste to go through all this effort of computing a proper probability distribution over our parameters only to settle for another point estimate, right? (Except when nearly all of the posterior’s mass is centered around one point in parameter space). Because MAP provides a point estimate, it is not considered a full Bayesian treatment.

The full fledged Bayesian approach is to specify a **predictive distribution** .

This defines the probability for class label * *given new input and dataset . To compute the predictive distribution we need to marginalize over our parameter settings again! We multiply the posterior probability of each setting with the probability of label given input using parameter setting . This is called **Bayesian Model Averaging**, or **BMA,** we take a weighted average over all possible models (parameter settings in this case). The predictive distribution is the **second** important place for marginalization in Bayesian ML, the first being the posterior computation itself. An intuitive way to visualize a predictive distribution is with a simple regression task, like in the Figure below. For a concrete example check out these slides (slide 9–21).

As we know by now, the integral in the predictive distribution is often intractable and at the very least extremely computationally expensive. A third method to using a posterior is by sampling a few parameters settings and combining the resulting models (e.g. approximate BMA). This is actually called a **Monte Carlo** approximation of the predictive distribution!

This last method is vaguely reminiscent of something perhaps more familiar to a humble frequentist: deep ensembles. Deep ensembles are formed by combining neural networks that are architecturally identical, but trained with different parameter initializations. This beautifully ties in with where we left off in the paper! Remember the abstract?

“We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction”.

Reading the abstract for the second time, the contributions should make a lot more sense. Also, we are now finally steering into Bayesian Deep Learning territory!

A Bayesian Neural Network (BNN) is simply posterior inference applied to a neural network architecture. To be precise, a prior distribution is specified for each weight and bias. Because of their huge parameter space, however, inferring the posterior is even more difficult than usual.

So why do Bayesian DL at all?

The classic answer is to obtain a **realistic expression of uncertainty** or *calibration*. A classifier is considered calibrated if the probability (confidence) of a class prediction aligns with its misclassification rate. As said before, this is crucial in real world applications.

”Neural networks are often miscalibrated in the sense that their predictions are typically overconfident.”

However, the authors of our running paper, Wilson and Izmailov, argue that Bayesian model averaging increases **accuracy** as well. According to Section 3.1, the Bayesian perspective is in fact **especially** compelling for neural networks! Because of their large parameter space, neural networks can represent many different solutions, e.g. they are underspecified by the data. This means a Bayesian model average is extremely useful because it combines a diverse range of functional forms, or “perspectives”, into one.

”A neural network can represent many models that are consistent with our observations. By selecting only one, in a classical procedure, we lose uncertainty when the models disagree for a test point.”

A number of people have recently been trying to combine the advantages of a traditional neural network (e.g. computationally efficient training using SGD & back propagation) with the advantages of a Bayesian approach (e.g. calibration).

One popular and conceptually easy approach is Monte Carlo dropout. Recall that dropout is traditionally used as regularization; it provides stochasticity or variation in a neural network by randomly shutting down weights **during training** It turns out dropout can be reinterpreted as approximate Bayesian inference and applied during testing, which leads to multiple different parameter settings. Sounds a little similar to sampling parameters from a posterior to approximate the predictive distribution, mh?

Another line of work follows from Stochastic Weight Averaging (SWA), an elegant approximation to ensembling that intelligently combines weights of the same network at different stages of training (check out this or this blogpost if you want to know more. SWA-Gaussian (SWAG) builds on it by approximating the shape (local geometry) of the posterior distribution using simple information provided by SGD. Recall that SGD “moves” through the parameter space looking for a (local) optimum in loss space. To approximate the local geometry of the posterior, they fit a Gaussian distribution to the first and second moment of the SGD iterate. Moments describe the shape of a function or distribution, where the zero moment is the the sum, the first moment is the mean, and the second moment is the variance. These fitted Gaussian distributions can then be used for BMA.

I have obviously failed to mention at least 99% of the field here (e.g. KFAC Laplace and temperature scaling for improved calibration), and picked the examples above in part because they are related to our running paper. I’ll finish with one last example of a recent **frequentist** (or is it…) alternative to uncertainty approximation. This is a popular method showing that one can train a deep ensemble and use it to form a predictive distribution, resulting in a well calibrated model. They use a few bells and whistles that I won’t go into, such as adversarial training to smoothen the predictive distribution. Check out the paper here.

By now we are more than ready to circle back to the paper and go over its contributions! They should be easier to grasp

Contrary to how recent literature (myself included) has framed it, Wilson and Izmailov argue that deep ensembles are **not** a frequentist alternative to obtain Bayesian advantages. In fact, they are a very good **approximation** of the posterior distribution. Because deep ensembles are formed by MAP or MLE retraining, they can form different *basins of attraction*. A basin of attraction is a “basin” or valley in the loss landscape that leads to some (locally) optimal solution. But there might be, and usually are, multiple optimal solutions, or valleys in the loss landscape. The use of multiple basins of attraction, found by different parts of an ensemble, results in more functional diversity than Bayesian approaches that focus on approximating posterior within single basin of attraction.

This idea of using multiple basins of attraction is important for the next contribution as well: an improved method for approximating predictive distributions. By combining the multiple basins of attraction property that deep ensembles have with the Bayesian treatment in SWAG, the authors propose a best-of-both-worlds solution: **Multi**ple basins of attraction **S**tochastic **W**eight **A**veraging **G**aussian or **MultiSWAG**:

”MultiSWAG combines multiple independently trained SWAG approximations, to create a mixture of Gaussians approximation to the posterior, with each Gaussian centred on a different basin. We note that MultiSWAG does not require any additional training time over standard deep ensembles.”

Have a look at the paper if you’re interested in the nitty gritty details

How can we ever specify a meaningful prior over millions of parameters, I hear you ask? It turns out this is a pretty valid question. In fact, the Bayesian approach is sometimes criticised because of it.

However, in Section 5 of the paper Wilson and Izmailov provide evidence that specifying a vague prior, such as a simple Gaussian might actually not be such a bad idea.

”Vague Gaussian priors over parameters, when combined with a neural network architecture, induce a distribution over functions with useful inductive biases.” …

… ”The distribution over functions controls the generalization properties of the model; the prior over parameters, in isolation, has no meaning.”

A vague prior combined with the functional form of a neural network results in a meaningful distribution in function space. The prior itself doesn’t matter, but its effect on the resulting predictive distribution does.

We have now arrived at the strange neural network phenomena I highlighted in the abstract. According to Section 6, the surprising fact that neural networks can fit random labels is actually not surprising at all. Not if you look at it from perspective of support and inductive bias. Broad support, the range of datasets for which , is important for generalization. In fact, the ability to fit random labels is perfectly fine as long as we have the right inductive bias to steer the model towards a good solution. Wilson and Izmailov also show that this phenomena is not mysteriously specific to neural networks either, and that Gaussian Processes exhibit the same ability.

The second phenomena is double descent. Double descent is a recently discovered phenomena where bigger models and more data unexpectedly decreases performance.

Wilson and Izmailov find that models trained with SGD suffer from double descent, but that SWAG reduces it. More importantly, both MultiSWAG as well as deep ensembles completely mitigate the double descent phenomena! This is in line with their previously discussed claim that

”Deep ensembles provide a better approximation to the Bayesian predictive distribution than conventional single-basin Bayesian marginalization procedures.”

and highlights the importance of marginalization over multiple modes of the posterior.

You made it! Thank you for reading all the way through. This post became quite lengthy but I hope you learned a lot about Bayesian DL. I sure did.

Note that I am not affiliated to Wilson, Izmailov or their group at NYU. This post reflects my own interpretation of their work, except for the quote blocks taken directly from the paper.

Please feel free to ask any question or point out mistakes that I’ve undoubtedly made. I would also love to know whether you liked this post. You can find my contact details on my website, message me on Twitter or connect on LinkedIn.

*This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.*

We’ll let you know when we release more technical education.

The post A Comprehensive Introduction to Bayesian Deep Learning appeared first on TOPBOTS.

]]>The post Extractive Text Summarization Using Contextual Embeddings appeared first on TOPBOTS.

]]>**If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material. **

The approach to building the summarizer can be divided into the following steps:

- Convert the article/passage to a list of sentences using nltk’s sentence tokenizer.
- For each sentence, extract contextual embedding using Sentence Transformer.
- Apply Kmeans clustering on the embeddings. The idea is to cluster the sentences that are contextually similar to each other and pick one sentence from each cluster that is closest to the mean(centroid).
- For each sentence embedding, calculate the distance from the centroid. Sometimes, the centroids are the actual sentence embeddings and in that case, the distance would be zero.
- For each cluster, select the embedding (sentence) with the lowest distance from the centroid and return the summary based on the order in which the sentences appeared in the original text.

Sentence Transformer is a python package that enables you to represent your sentences and paragraphs as dense vectors. The package is compatible with state-of-the-art models like BERT, RoBERTa, XLM-RoBERTa, etc. Choosing the correct model is very important as models perform better on tasks they were designed to address. For our use case, we will be using STS (Semantic Textual Similarity) based models. To read more about Sentence Transformer, check out the following links;

- Source code: UKPLab / sentence-transformers
- List of pre-trained models: Pretrained Models
- List of STS models: SentenceTransformer Pretrained Models

Now that we have an understanding of the use case and approach, let’s dive into the implementation of the summarizer. We will be using the following python packages:

- Pandas
- Numpy
- Sentence Transformer
- NLTK’s KMeanClusterer

So let’s import the above-mentioned packages:

import nltk import pandas as pd from sentence_transformers import SentenceTransformer from nltk.cluster import KMeansClusterer import numpy as np

The next step is to initialize the SentenceTransformer with the appropriate model. As mentioned above, I am going to use STS based model i.e. stsb-roberta-base. Feel free to experiment with other models.

model = SentenceTransformer('stsb-roberta-base')

We are going to perform a summarization of the latest new article describing the new Aston Martin Formula 1 car.

```
article='''
This week's launch of Aston Martin’s new Formula 1 car was one of the most hyped events of the pre-season so far, as fans were intrigued by how the new-look AMR21 would be painted. Unlike the car launches that came before it, Aston Martin left very little to the imagination, releasing detailed shots of the entire car. The first thing to note is that the team spent both of its development tokens on redesigning the chassis, in order that it could unlock aerodynamic performance from the central portion of the car. This is, in part, a legacy of the team’s approach for 2020, having assimilated the overall design package of the previous year’s championship winning Mercedes including a more conventional position for the side-impact protection spars (SIPS). The low-slung arrangement, as introduced by Ferrari in 2017, is now considered critical from an aerodynamic perspective, with the sidepod inlet positioned much like a periscope. This is typically above the fairing that surrounds the SIPS, which is used to inhibit the turbulence created by the front tyre and therefore also aids the transit of cool air that’s supplied to the radiators within the sidepods. This image of the car depicts how the bargeboards are used to filter the turbulence created by the front tyre and convert it into something more usable. Meanwhile, the airflow fed from the front of the car, including the cape, is forced around the underside of the sidepod whilst the fairing around the SIPS shields the airflow entering the sidepod inlet. This should result in a much cleaner flow arriving at the radiators, with the air having not been worked too hard by numerous surfaces en route. The inlet itself is extremely narrow with the team recovering some of that with the sculpting on the sides of the chassis. The narrowness of the inlet also draws your attention to the substantial fin that grows out of the sidepod’s shoulder and helps to divert airflow down over the revamped sidepod packaging behind. This is an area where the team has clearly focused its resources, knowing that getting this right will reap aerodynamic rewards for other areas of the car. The sidepod design draws inspiration from the new bodywork that the team installed in Mugello last season (below) but falls short of having the full ramp to floor transition, instead favouring the dipped midriff like we’ve seen adopted elsewhere. The rear portion of the sidepods and the engine cover have extremely tight packaging, with the AMR21 akin to the W12 with the bodywork almost shrink wrapped to the componentry inside. And, much like the W12, it also features a bodywork blister around the inlet plenum, a feature of the power unit which is believed to be bigger this season as a result of some of the performance and durability updates introduced by HPP. The AMR12 also features a very small rear cooling outlet that not only shows how efficient they expect the Mercedes-AMG F1 M12 E Performance power unit to be, but also how much they have focused on producing a car that recovers the downforce lost by the introduction of the new regulations. The extremely tight packaging creates a sizable undercut beneath the cooling outlet too, which buys back some of the floor that has been lost to the new regulations and drives home the performance of the coke bottle region. This is aided further by the token-free adoption of the Mercedes gearbox carrier and rear suspension from last season, an arrangement that Mercedes was particularly proud of because of the aerodynamic gains that it facilitates. The new arrangement sees the suspension elements lifted clear of the diffuser ceiling, which has become more prominent as the teams push the boundaries of the regulations, while the rear leg of the lower wishbone being positioned so far back also results in the ability to extract more performance from the diffuser. Aston Martin is the first team to unmask all the aerodynamic tricks it will use to make up the difference on the edge of the diagonal floor cut-out. The first of these tricks shares a similarity to the design shown by AlphaTauri, with a trio of outwardly directed fins installed just behind the point where the floor starts to taper in. The airflow structures emitted from these fins will undoubtedly interact with the AlphaTauri-esque floor scroll and floor notch just ahead of them and help to mitigate some of the losses that have been created due to fully enclosed holes being outlawed and the reduced floor width ahead of the rear tyre. It’s here where we find a solution akin to the one that Ferrari tested at the end of 2020 too, as a series of fins form an arc. This should help influence the airflow ahead of the rear tyre and reduce the impact that tyre squirt has on the diffuser. Interestingly, it has also added two offset floor strakes inboard of this where teams normally only opt for one strake, with Mercedes in the pre-hybrid era being an advocate of such designs. A new solution appears on the rear wing too, as the thickness of the upper front corner of the endplate has been altered to allow for another upwash strike. Teams had already started to look for ways to redesign this region last year, with the removal of the louvres in 2019 resulting in an increase in drag. The upwash strike is positioned in order that it can affect the tip vortex that’s generated by the top flap and endplate juncture and will undoubtedly be a design aspect that the rest of the field will take note of. While Aston Martin did show us a lot of its new car, it did keep one element secret for now – the rear brake ducts (not pictured, above). It does seem like a strange omission given it has shown us so much around the rest of the car but we must remember that this is one aspect of the 2021 cars that’s affected by the new regulations. Perhaps the team feels it has found a small pocket of performance in that regard and doesn’t want to unnecessarily hand its rivals a chance to see it ahead of testing.
'''
```

Now, we convert the above article to a list of sentences. We will use nltk’s **sent_tokenize()** method.

sentences=nltk.sent_tokenize(article) # strip leading and trailing spaces sentences = [sentence.strip() for sentence in sentences]

Output:

```
["\nThis week's launch of Aston Martin’s new Formula 1 car was one of the most hyped events of the pre-season so far, as fans were intrigued by how the new-look AMR21 would be painted.",
'Unlike the car launches that came before it, Aston Martin left very little to the imagination, releasing detailed shots of the entire car.',
'The first thing to note is that the team spent both of its development tokens on redesigning the chassis, in order that it could unlock aerodynamic performance from the central portion of the car.',
'This is, in part, a legacy of the team’s approach for 2020, having assimilated the overall design package of the previous year’s championship winning Mercedes including a more conventional position for the side-impact protection spars (SIPS).',
'The low-slung arrangement, as introduced by Ferrari in 2017, is now considered critical from an aerodynamic perspective, with the sidepod inlet positioned much like a periscope.',
'This is typically above the fairing that surrounds the SIPS, which is used to inhibit the turbulence created by the front tyre and therefore also aids the transit of cool air that’s supplied to the radiators within the sidepods.',
'This image of the car depicts how the bargeboards are used to filter the turbulence created by the front tyre and convert it into something more usable.',
'Meanwhile, the airflow fed from the front of the car, including the cape, is forced around the underside of the sidepod whilst the fairing around the SIPS shields the airflow entering the sidepod inlet.',
'This should result in a much cleaner flow arriving at the radiators, with the air having not been worked too hard by numerous surfaces en route.',
'The inlet itself is extremely narrow with the team recovering some of that with the sculpting on the sides of the chassis.',
'The narrowness of the inlet also draws your attention to the substantial fin that grows out of the sidepod’s shoulder and helps to divert airflow down over the revamped sidepod packaging behind.',
'This is an area where the team has clearly focused its resources, knowing that getting this right will reap aerodynamic rewards for other areas of the car.',
'The sidepod design draws inspiration from the new bodywork that the team installed in Mugello last season (below) but falls short of having the full ramp to floor transition, instead favouring the dipped midriff like we’ve seen adopted elsewhere.',
'The rear portion of the sidepods and the engine cover have extremely tight packaging, with the AMR21 akin to the W12 with the bodywork almost shrink wrapped to the componentry inside.',
'And, much like the W12, it also features a bodywork blister around the inlet plenum, a feature of the power unit which is believed to be bigger this season as a result of some of the performance and durability updates introduced by HPP.',
'The AMR12 also features a very small rear cooling outlet that not only shows how efficient they expect the Mercedes-AMG F1 M12 E Performance power unit to be, but also how much they have focused on producing a car that recovers the downforce lost by the introduction of the new regulations.',
'The extremely tight packaging creates a sizable undercut beneath the cooling outlet too, which buys back some of the floor that has been lost to the new regulations and drives home the performance of the coke bottle region.',
'This is aided further by the token-free adoption of the Mercedes gearbox carrier and rear suspension from last season, an arrangement that Mercedes was particularly proud of because of the aerodynamic gains that it facilitates.',
'The new arrangement sees the suspension elements lifted clear of the diffuser ceiling, which has become more prominent as the teams push the boundaries of the regulations, while the rear leg of the lower wishbone being positioned so far back also results in the ability to extract more performance from the diffuser.',
'Aston Martin is the first team to unmask all the aerodynamic tricks it will use to make up the difference on the edge of the diagonal floor cut-out.',
'The first of these tricks shares a similarity to the design shown by AlphaTauri, with a trio of outwardly directed fins installed just behind the point where the floor starts to taper in.',
'The airflow structures emitted from these fins will undoubtedly interact with the AlphaTauri-esque floor scroll and floor notch just ahead of them and help to mitigate some of the losses that have been created due to fully enclosed holes being outlawed and the reduced floor width ahead of the rear tyre.',
'It’s here where we find a solution akin to the one that Ferrari tested at the end of 2020 too, as a series of fins form an arc.',
'This should help influence the airflow ahead of the rear tyre and reduce the impact that tyre squirt has on the diffuser.',
'Interestingly, it has also added two offset floor strakes inboard of this where teams normally only opt for one strake, with Mercedes in the pre-hybrid era being an advocate of such designs.',
'A new solution appears on the rear wing too, as the thickness of the upper front corner of the endplate has been altered to allow for another upwash strike.',
'Teams had already started to look for ways to redesign this region last year, with the removal of the louvres in 2019 resulting in an increase in drag.',
'The upwash strike is positioned in order that it can affect the tip vortex that’s generated by the top flap and endplate juncture and will undoubtedly be a design aspect that the rest of the field will take note of.',
'While Aston Martin did show us a lot of its new car, it did keep one element secret for now – the rear brake ducts (not pictured, above).',
'It does seem like a strange omission given it has shown us so much around the rest of the car but we must remember that this is one aspect of the 2021 cars that’s affected by the new regulations.',
'Perhaps the team feels it has found a small pocket of performance in that regard and doesn’t want to unnecessarily hand its rivals a chance to see it ahead of testing.']
```

** Note: **As we are using the transformer model, it’s recommended not to remove stopwords, punctuation, etc. as they help to capture more context compared to preprocessed text.

For applying different transformations of the data efficiently, we will use Pandas DataFrame. Let’s convert the above list to a pandas data frame;

data = pd.DataFrame(sentences) data.columns=['sentence']

Output:

The next step is to represent the sentences as dense vectors. We will create a small UDF that returns a vector given a sentence. We have already created an instance of the Sentence Transformer above.

def get_sentence_embeddings(sentence): embedding = model.encode([sentence]) return embedding[0]

Create a new column ‘embeddings’ using the above UDF.

data['embeddings']=data['sentence'].apply(get_sentence_embeddings)

Output:

Now that we have the text embeddings, let’s cluster them using NLTK’s KMeansClusterer.

NUM_CLUSTERS=10 iterations=25 X = np.array(data['embeddings'].tolist()) kclusterer = KMeansClusterer( NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=iterations,avoid_empty_clusters=True) assigned_clusters = kclusterer.cluster(X, assign_clusters=True)

*Note: The intuition for the NUM_CLUSTERS parameter is the number of sentences the end-user expects in the summary.*

As you can observe, in NLTK’s KMeanClsuterer, we can use cosine_distance as a measure to determine the distance/similarity between 2 vectors.

Output:

`[9,0,2,9,4,4,2,4,8,5,6,0,1,5,3,5,5,5,8,5,6,6,9,8,9,2,9,4,5,6,7]`

Finally, we compute the distance between the sentence vector and the centroid (also called mean) vector. To achieve this, we need to assign the centroid to each row based on the cluster number.

data['cluster']=pd.Series(assigned_clusters, index=data.index) data['centroid']=data['cluster'].apply(lambda x: kclusterer.means()[x])

Output:

To compute the distance, we will use scipy’s **distance_matrix **function.

from scipy.spatial import distance_matrix def distance_from_centroid(row): #type of emb and centroid is different, hence using tolist below return distance_matrix([row['embeddings']], [row['centroid'].tolist()])[0][0] data['distance_from_centroid'] = data.apply(distance_from_centroid, axis=1)

Output:

The final step is to generate a summary. To do this, we will use the following steps:

- Group sentences based on the cluster column.
- Sort the group in ascending order based on the distance_from_centroid column and select the first row (sentence having least distance from the mean)
- Sort the sentences based on their sequence in the original text.

The above-mentioned steps can be implemented using one line of code:

summary=' '.join(data.sort_values('distance_from_centroid',ascending = True).groupby('cluster').head(1).sort_index()['sentence'].tolist())

Extracted Summary:

`The first thing to note is that the team spent both of its development tokens on redesigning the chassis, in order that it could unlock aerodynamic performance from the central portion of the car. This is, in part, a legacy of the team’s approach for 2020, having assimilated the overall design package of the previous year’s championship winning Mercedes including a more conventional position for the side-impact protection spars (SIPS). This is typically above the fairing that surrounds the SIPS, which is used to inhibit the turbulence created by the front tyre and therefore also aids the transit of cool air that’s supplied to the radiators within the sidepods. This is an area where the team has clearly focused its resources, knowing that getting this right will reap aerodynamic rewards for other areas of the car. The sidepod design draws inspiration from the new bodywork that the team installed in Mugello last season (below) but falls short of having the full ramp to floor transition, instead favouring the dipped midriff like we’ve seen adopted elsewhere. And, much like the W12, it also features a bodywork blister around the inlet plenum, a feature of the power unit which is believed to be bigger this season as a result of some of the performance and durability updates introduced by HPP. The AMR12 also features a very small rear cooling outlet that not only shows how efficient they expect the Mercedes-AMG F1 M12 E Performance power unit to be, but also how much they have focused on producing a car that recovers the downforce lost by the introduction of the new regulations. The airflow structures emitted from these fins will undoubtedly interact with the AlphaTauri-esque floor scroll and floor notch just ahead of them and help to mitigate some of the losses that have been created due to fully enclosed holes being outlawed and the reduced floor width ahead of the rear tyre. This should help influence the airflow ahead of the rear tyre and reduce the impact that tyre squirt has on the diffuser. Perhaps the team feels it has found a small pocket of performance in that regard and doesn’t want to unnecessarily hand its rivals a chance to see it ahead of testing.`

As you can see, the approach we followed does a decent job of generating a summary that describes the car. If you read carefully, the sentences are pretty much inline provided that we picked the top sentence from each cluster. For e.g. the initial part of the summary talks about SIPS (Side Impact Protection Spars) i.e. sentences 2, 3, and 4 while the following sentence talks about the bodywork i.e. sentences 5 and 6, and so on. So, we successfully describe each aspect of the car using a couple of sentences by implementing extractive text summarization.

**The code explained here is also available on my ****Github**. Feel free to fork it and play with it.

Thank you for reading! Like and leave a comment if you found it interesting and useful.

Sentence Embedding Based Semantic Clustering Approach for Discussion Thread Summarization

nltk.cluster.kmeans – NLTK 3.5 documentation

*This article was originally published on Medium and re-published to TOPBOTS with permission from the author.*

We’ll let you know when we release more technical education.

The post Extractive Text Summarization Using Contextual Embeddings appeared first on TOPBOTS.

]]>The post Graph Neural Networks for Multi-Relational Data appeared first on TOPBOTS.

]]>The article includes 4 main sections:

- an introduction to the key idea of multi-relational data, which describes the peculiarity of KGs;
- a summary of the standard components included in a GNN architecture;
- a description of the simplest formulation of GNNs, known as Graph Convolutional Networks (GCNs);
- a discussion on how to extend the GCN layer in the form of a Relational Graph Convolutional Network (R-GCN) to encode multi-relational data.

A basic graph structure includes undirected, no-typed, and unique edges for connecting nodes. For instance, in the philosophical domain, we can define a link between two nodes represented by “Socrates” and “Plato” entities. In this specific case, we do not provide any information on the nature of the relationship between these philosophers.

On the other side, KGs include directed, typed, and *multiple* edges for connecting nodes. Considering our running example, the connection from “Socrates” to “Plato” can be labeled with “influenced”. In the opposite direction, another connection can be established from “Plato” to “Socrates”, which can be labeled with “influenced by”.

In other words, KGs are graph-based structures whose nodes represent real-world entities, while edges define multiple relations between these entities. I explained further details on KGs in the following article: Knowledge Graphs at a Glance.

I summarized the main building blocks of a GNN architecture in the following article: Understanding the Building Blocks of Graph Neural Networks (Intro).

In particular, I presented an overview of the main components to set up a GNN, including (i) the input layer, (ii) the GNN layers(s), and (iii) the Multilayer Perceptron (MLP) prediction layer(s).

In this architecture, the GNN layer is the key component for encoding the local graph structure, which is employed to update node representation. Different GNN layers employ different types of aggregation of the local graph structure. To illustrate the GNN behavior using NumPy, we require 3 main ingredients:

- a matrix of one-hot vectors (no features) representing nodes;
- a weight matrix describing the hidden features of the nodes;
- an adjacency matrix defining undirected edges between nodes.

### One-hot vector representation of nodes (5,5): X = np.eye(5, 5) n = X.shape[0] np.random.shuffle(X) print(X)

----- [[0. 0. 1. 0. 0.] # Node 1 [0. 1. 0. 0. 0.] # Node 2 [0. 0. 0. 0. 1.] ... [1. 0. 0. 0. 0.] [0. 0. 0. 1. 0.]] # Node 5### Weight matrix (5,3) # Dimension of the hidden features h = 3 # Random initialization with Glorot and Bengio W = np.random.uniform(-np.sqrt(1./h), np.sqrt(1./h),(n,h)) print(W)----- [[-0.4294049 0.57624235 -0.3047382 ] [-0.11941829 -0.12942953 0.19600584] [ 0.5029172 0.3998854 -0.21561317] [ 0.02834577 -0.06529497 -0.31225734] [ 0.03973776 0.47800217 -0.04941563]]### Adjacency matrix of an undirect Graph (5,5) A = np.random.randint(2, size=(n, n)) # Include the self loop np.fill_diagonal(A, 1) # Symmetric adjacency matrix (undirected graph) A_und = (A + A.T) A_und[A_und > 1] = 1 print(A_und)----- [[1 1 1 0 1] # Connections to Node 1 [1 1 1 1 1] [1 1 1 1 0] [0 1 1 1 1] [1 1 0 1 1]]

Considering these ingredients, a “recursive neighborhood diffusion” (Dwivedi et al., 2020) is performed through the so-called “message passing framework” (Gilmer et al., 2017) for the update process. The neighbors’ features are passed to the target node as messages through the edges. Concretely, the required operations are the following (see the NumPy code for further details):

- a linear transformation (or projection) involving the initial representation of the nodes and the weight matrix including their hidden features.
- a neighborhood diffusion to update the representation of a node, aggregating the hidden features of its neighbors. This operation is computed in parallel for each node.

### Linear transformation L_0 = X.dot(W) print(L_0)----- [[0.5029172 0.3998854 -0.21561317] # Node 1 (3rd row of W) [-0.11941829 -0.12942953 0.19600584] # Node 2 (2nd row of W) [0.03973776 0.47800217 -0.04941563] # Node 3 (5th row of W) [-0.4294049 0.57624235 -0.3047382 ] [0.02834577 -0.06529497 -0.31225734]] # Node 5 (4th row of W)### GNN - Neighborhood diffusion ND_GNN = A_und.dot(L_0) print(ND_GNN)----- [[0.45158244 0.68316307 -0.3812803] # Updated Node 1 [ 0.02217754 1.25940542 -0.6860185 ] [-0.00616823 1.3247004 -0.37376116] [-0.48073966 0.85952002 -0.47040533] [-0.01756022 0.78140325 -0.63660287]]### Test on the aggregation assert(ND_GNN[0,0] == L_0[0,0] + L_0[1,0] + L_0[2,0] + L_0[4,0])

Observing the result of the neighborhood diffusion, you can notice that the updated representation of Node 1

[0.5029172 0.3998854 -0.21561317] # Node 1

is the sum of the vectors representing Node 1 (self-loop), Node 2, Node 3, and Node 4. Such nodes are connected to Node 1, according to the adjacency matrix defined previously.

[ 0.5029172 0.3998854 -0.21561317] # Node 1 [-0.11941829 -0.12942953 0.19600584] # Node 2 [ 0.03973776 0.47800217 -0.04941563] # Node 3 [ 0.02834577 -0.06529497 -0.31225734] # Node 5

The code described in this section can be mathematically formalized as follows:

In the simplest formulation of GNNs, known as Vanilla Graph Convolutional Networks (GCNs), the node update is performed via an “isotropic averaging operation over the neighborhood features” (Dwivedi et al., 2020). In other words, neighbor nodes equally contribute to updating the central node’s representation. More precisely, in the specific case of Vanilla GCNs, an *isotropic average computation* is performed. This computation requires a new ingredient represented by each node’s degree, consisting of the number of its connected edges.

### Degree vector (degree for each node) D = A_und.sum(axis=1) print(D)----- [45 4 4 4] # Degree of Node 1### Reciprocal of the degree (diagonal matrix) D_rec = np.diag(np.reciprocal(D.astype(np.float32))) print(D_rec)----- [[0.250. 0. 0. 0. ] # Reciprocal value of Node 1 degree [0. 0.2 0. 0. 0. ] [0. 0. 0.25 0. 0. ] [0. 0. 0. 0.25 0. ] [0. 0. 0. 0. 0.25]]### GCN - Isotropic average computation ND_GCN = D_rec.dot(ND_GNN) print(ND_GCN)----- [[0.11289561 0.17079077 -0.09532007] # Updated Node 1 (with deg) [ 0.00443551 0.25188109 -0.1372037 ] [-0.00154206 0.3311751 -0.09344029] [-0.12018491 0.21488001 -0.11760133] [-0.00439005 0.19535081 -0.15915072]]### Test on the isotropic average computation: assert(ND_GCN[0,0] == ND_GNN[0,0] * D_rec[0,0])

Each element of the degree vector represents the degree value of the *i*-node. Indeed, the vector’s first element is equal to 4 because the adjacency matrix shows that 4 nodes are connected to Node 1. The degrees’ reciprocal values are then computed to enable the average contribution of the edges connected to the node. Finally, the isotropic average computation is performed according to the GCNs formulation. The updated representation of Node 1 performing the average computation using the Node 1 degree is equal to:

[ 0.11289561 0.17079077 -0.09532007]

This vector is obtained by multiplying the following vector (representing the aggregated representation of Node 1) with the reciprocal of its degree (0.25):

[ 0.45158244 0.68316307 -0.3812803 ]

The code described in this section can be mathematically formalized as follows:

The previous example describes the behavior of the GCNs on an undirected and no-typed graph. As mentioned before, the update process is based on the following steps (in the following explanation, the node degree is not considered for the sake of simplicity).

A *projection step* (or linear transformation) is achieved by multiplying (i) the one-hot feature matrix with (ii) the weight matrix.

- (i) 2D Matrix (n, n) defining the one-hot vectors to represent the nodes.
- (ii) 2D Matrix (n, h) defining the hidden features. The current matrix encodes only one type of relation.

An* aggregation step* is achieved by multiplying (i) the adjacency matrix with (ii) the matrix resulting from the projection step.

- (i) 2D Symmetric matrix (n, n) describing undirected and untyped edges.
- (ii) 2D Matrix (n, h) resulting from the projection step.

To extend the GCN layer to encode a KG structure, we need to represent our data as a directed and multi-typed graph. The update/aggregation process is similar to the previous one, but the ingredients are a bit more complex. Details on the steps to perform are available below.

A* projection step* is achieved by multiplying (i) the one-hot feature matrix with (ii) the weight *tensor*.

- (i) 2D Matrix (n, n) defining the initial features of the nodes.
- (ii)
*3D**Tensor (r, n, h)*describing the node hidden features. This tensor is able to encode different relations by stackingbatches of matrices with size*r*. Each of these batches encodes a single typed relation.*(n, h)*

The projection step will no longer be a simple multiplication of matrices, but it will be a *batch matrix multiplication*, in which (i) is multiplied with each batch of (ii).

An *aggregation step*, which is achieved by multiplying (i) the *(directed) adjacency tensor* with (ii) the *tensor* resulting from the projection step.

- (i)
*3D Tensor (r, n, n)*describing directed andedges. This tensor is composed of*r-typed*batches of adjacency matrices*r*. In detail, each of these matrices describes the edges between nodes, according to a specific type of relation. Moreover, compared to the adjacency matrix of an undirected graph, each of these adjacency matrices is not symmetric because it encodes a specific edge direction.*(n, n)* - (ii)
*3D Tensor (r, n, h)*resulting from the projection step described above.

As happened for the projection step, the aggregation phase consists of a *batch matrix multiplication*. Each batch of (i) is multiplied with each batch of (ii). This aggregation defines the GCN transformation for each batch. At the end of the process, the batches have to be added together (R-GCN) to obtain a node representation that incorporates the neighborhood aggregation according to different relations types.

The following code example shows an R-GCN layer’s behavior encoding a directed and multi-typed graph, or a KG, with 2 types of edges (or relations).

### Recall: One-hot vector representation of nodes (n,n) print(X)----- [[0. 0. 1. 0. 0.] # Node 1 [0. 1. 0. 0. 0.] ... [0. 0. 0. 0. 1.] [1. 0. 0. 0. 0.] [0. 0. 0. 1. 0.]]### Number of relation types (r) num_rels = 2 print(num_rels)----- 2### Weight matrix of relation number 1 (n,n) ## Initialization according to Glorot and Bengio (2010)) W_rel1 = np.random.uniform(-np.sqrt(1./h),np.sqrt(1./h),(n,h)) print(W_rel1)----- [[-0.46378913 -0.09109707 0.52872529] [ 0.03829597 0.22156061 -0.2130242 ] [ 0.21535272 0.38639244 -0.55623279] [ 0.28884178 0.56448816 0.28655701] [-0.25352144 0.334031 -0.45815514]]### Weight matrix of relation number 2 (n,h) ## Random initialization with uniform distribution W_rel2 = np.random.uniform(1/100, 0.5, (n,h)) print(W_rel2)----- [[0.22946783 0.4552118 0.15387093] [0.15100992 0.073714 0.01948981] [0.34262941 0.11369778 0.14011786] [0.25087085 0.03614765 0.29131763] [0.081897 0.29875971 0.3528816 ]]### Tensor including both weight matrices (r,n,h) W_rels = np.concatenate((W_rel1, W_rel2)) W_rels = np.reshape(W_rels,(num_rels, n, h)) print(W_rels)----- [[[-0.46378913 -0.09109707 0.52872529] [ 0.03829597 0.22156061 -0.2130242 ] [ 0.21535272 0.38639244 -0.55623279] [ 0.28884178 0.56448816 0.28655701] [-0.25352144 0.334031 -0.45815514]] [[ 0.22946783 0.4552118 0.15387093] [ 0.15100992 0.073714 0.01948981] [ 0.34262941 0.11369778 0.14011786] [ 0.25087085 0.03614765 0.29131763] [ 0.081897 0.29875971 0.3528816 ]]]### Linear trasformationwith batch matrix multiplication (r,n,h) L_0_rels = np.matmul(X, W_rels) print(L_0_rels)----- [[[ 0.21535272 0.38639244 -0.55623279] # Node 1 (3rd row of W_rel1) [ 0.03829597 0.22156061 -0.2130242 ] [-0.25352144 0.334031 -0.45815514] [-0.46378913 -0.09109707 0.52872529] [ 0.28884178 0.56448816 0.28655701]] [[ 0.34262941 0.11369778 0.14011786] # Node 1 (3rd row of W_rel2) [ 0.15100992 0.073714 0.01948981] [ 0.081897 0.29875971 0.3528816 ] [ 0.22946783 0.4552118 0.15387093] [ 0.25087085 0.03614765 0.29131763]]]### Adjacency matrix of relation number 1 (n,n) A_rel1 = np.random.randint(2, size=(n, n)) np.fill_diagonal(A, 0) # No self_loop print(A_rel1)----- [[0 1 1 1 1] # Connections to Node 1 with Rel 1 [1 1 0 0 1] # Connections to Node 2 with Rel 1 [1 0 0 1 0] [0 0 1 1 1] [1 1 0 1 0]]### Adjacency matrix of relation number 2 (n,n) A_rel2 = np.random.randint(3,size=(n,n)) np.fill_diagonal(A_rel2, 0) # No self loop A_rel2[A_rel2>1] = 0----- [[0 0 0 1 0] # Connections to Node 1 with Rel 2 [1 0 0 0 0] # Connections to Node 2 with Rel 2 [1 0 0 1 1] [0 0 0 0 0] [0 1 0 0 0]]### Tensor including both adjacency matrices (r,n,n) A_rels = np.concatenate((A_rel1, A_rel2)) A_rels = np.reshape(A_rels, (num_rels, n, n)) print(A_rels)----- [[[0 1 1 1 1] # Connections to Node 1 with Rel 1 [1 1 0 0 1] [1 0 0 1 0] [0 0 1 1 1] [1 1 0 1 0]] [[0 0 0 1 0] # Connections to Node 2 with Rel 2 [1 0 0 0 0] [1 0 0 1 1] [0 0 0 0 0] [0 1 0 0 0]]]### (GCN) Neighborhood diffusion for each typed edge (r,n,h) ND_GCN = np.matmul(A_rels, L_0_rels) print(ND_GCN)----- [[[-0.39017282 1.0289827 0.14410296] # Updated Node 1 with Rel 1 [ 0.54249047 1.17244121 -0.48269997] [-0.24843641 0.29529538 -0.0275075 ] [-0.42846879 0.80742209 0.35712716] [-0.21014043 0.51685598 -0.2405317 ]] [[0.22946783 0.4552118 0.15387093] # Updated Node 1 with Rel 2 [ 0.34262941 0.11369778 0.14011786] [ 0.82296809 0.60505722 0.58530642] [ 0. 0. 0. ] [ 0.15100992 0.073714 0.01948981]]]### (R-GCN) Aggregation of GCN (n,h) RGCN = np.sum(ND_GCN, axis=0) print(RGCN)----- [[-0.16070499 1.48419449 0.29797389] Updated Node 1(Rel 1 + Rel 2) [ 0.88511988 1.28613899 -0.34258211] [ 0.57453168 0.9003526 0.55779892] [-0.42846879 0.80742209 0.35712716] [-0.05913052 0.59056998 -0.22104189]]### Test of the aggregation assert(RGCN[0,0] == L_0_rels[0,1,0] + L_0_rels[0,2,0] + L_0_rels[0,3,0] + L_0_rels[0,4,0] + L_0_rels[1,3,0])

As you can notice from this example, the result of the neighborhood diffusion (GCN) is a *3D tensor* of size *(r, n, h)* instead of a *2D matrix *of size *(n,h)*. The reason is that neighbor diffusion is performed in a separate way for each type of relation. The R-GCN layer performed an aggregation of the node representation achieved by the GCN for each type of relation. To clarify this aspect, consider the aggregated representation of Node 1

[-0.16070499 1.48419449 0.29797389]

This vector is obtained by summing the updated representation of Node 1 with Relation 1

[-0.39017282 1.0289827 0.14410296]

and the updated representation of Node 1 with Relation 2

[ 0.22946783 0.4552118 0.15387093]

The code described in this section can be mathematically formalized as follows:

R-GCNs represent a powerful graph neural architecture to encode multi-relational data, such as KGs. In a future article, I will show you how this encoding power can be exploited to perform specific tasks within KGs, including node classification and link prediction.

Much more details on the R-GCN architecture are available in the following research paper: Modeling Relational Data with Graph Convolutional Networks.

*This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.*

We’ll let you know when we release more technical education.

The post Graph Neural Networks for Multi-Relational Data appeared first on TOPBOTS.

]]>The post Graph Attention Networks Under the Hood appeared first on TOPBOTS.

]]>[..] these methods are based on some form of message passing on the graph allowing different nodes to exchange information.

For accomplishing specific tasks on graphs (node classification, link prediction, etc.), a GNN layer computes the node and the edge representations through the so-called *recursive neighborhood diffusion* (or *message passing*). According to this principle, each graph node receives and aggregates features from its neighbors in order to represent the local graph structure: different types of GNN layers perform diverse aggregation strategies.

The simplest formulations of the GNN layer, such as Graph Convolutional Networks (GCNs) or GraphSage, execute an isotropic aggregation, where each neighbor contributes equally to update the representation of the central node. This blog post introduces a mini-series (2 articles) dedicated to the analysis of Graph Attention Networks (GATs), which define an anisotropy operation in the recursive neighborhood diffusion. Exploiting the anisotropy paradigm, the learning capacity is improved by the attention mechanism, which assigns different importance to each neighbor’s contribution.

If you are completely new to GNNs and related concepts, I invite you to read the following introductory article: Understanding the Building Blocks of Graph Neural Networks (Intro).

*This warm-up is based on the GAT details reported by the Deep Graph Library **website**.*

Before understanding the GAT layer’s behavior, let’s recap the math behind the aggregation performed by the GCN layer.

*N*isthe set of the one-hop neighbors of node*i.*This node could also be included among the neighbors by adding a self-loop.*c*is a normalization constant based on the graph structure, which defines an isotropic average computation.*σ*is an activation function, which introduces non-linearity in the transformation.*W*is the weight matrix of learnable parameters adopted for feature transformation.

The GAT layer expands the basic aggregation function of the GCN layer, assigning different importance to each edge through the attention coefficients.

**Equation (1)**is a linear transformation of the lower layer embedding*h_i,*and*W*is its learnable weight matrix. This transformation helps achieve a sufficient expressive power to transform the input features into high-level and dense features.**Equation (2)**computes a pair-wise un-normalized attention score between two neighbors. Here, it first concatenates the*z*embeddings of the two nodes, where || denotes concatenation. Then, it takes a dot product of such concatenation and a learnable weight vector*a.*In the end, a LeakyReLU is applied to the result of the dot product. The attention score indicates the importance of a neighbor node in the message passing framework.**Equation (3)**applies a softmax to normalize the attention scores on each node’s incoming edges. The softmax encodes the output of the previous step in a probability distribution. As a consequence, the attention scores are much more comparable across different nodes.**Equation (4)**is similar to the GCN aggregation (see the equation at the beginning of the section). The embeddings from neighbors are aggregated together, scaled by the attention scores. The main consequence of this scaling process is to learn a different contribution from each neighborhood node.

The first step is to prepare the ingredients (matrices) to represent a simple graph and perform the linear transformation.

print('\n\n----- One-hot vector representation of nodes. Shape(n,n)\n') X = np.eye(5, 5) n = X.shape[0] np.random.shuffle(X) print(X) print('\n\n----- Embedding dimension\n') emb = 3 print(emb) print('\n\n----- Weight Matrix. Shape(emb, n)\n') W = np.random.uniform(-np.sqrt(1. / emb), np.sqrt(1. / emb), (emb, n)) print(W) print('\n\n----- Adjacency Matrix (undirected graph). Shape(n,n)\n') A = np.random.randint(2, size=(n, n)) np.fill_diagonal(A, 1) A = (A + A.T) A[A &amp;amp;gt; 1] = 1 print(A)

```
----- One-hot vector representation of nodes. Shape(n,n)
[[
```**0 0 1 0 0**] # node 1
[0 1 0 0 0] # node 2
[0 0 0 0 1]
[1 0 0 0 0]
[0 0 0 1 0]]
----- Embedding dimension
3
----- Weight Matrix. Shape(emb, n)
[[-0.4294049 0.57624235 **-0.3047382** -0.11941829 -0.12942953]
[ 0.19600584 0.5029172 **0.3998854** -0.21561317 0.02834577]
[-0.06529497 -0.31225734 **0.03973776** 0.47800217 -0.04941563]]
----- Adjacency Matrix (undirected graph). Shape(n,n)
[[**1 1 1 0 1**]
[**1** 1 1 1 1]
[**1** 1 1 1 0]
[**0** 1 1 1 1]
[**1** 1 0 1 1]]

The first matrix defines a one-hot encoded representation of the nodes (node 1 is shown in bold). Then, we define a weight matrix, exploiting the defined embedding dimension. I have highlighted the 3rd column vector of *W* because, as you will see shortly, this vector defines the updated representation of node 1 (a 1-value is initialized in the 3rd position). We can perform the linear transformation to achieve sufficient expressive power for node features starting from these ingredients. This step aims to transform the (one-hot encoded) input features into a low and dense representation.

# equation (1) print('\n\n----- Linear Transformation. Shape(n, emb)\n') z1 = X.dot(W.T) print(z1)

```
----- Linear Transformation. Shape(n, emb)
[[
```**-0.3047382 0.3998854 0.03973776**]
[ 0.57624235 0.5029172 -0.31225734]
[-0.12942953 0.02834577 -0.04941563]
[-0.4294049 0.19600584 -0.06529497]
[-0.11941829 -0.21561317 0.47800217]]

The next operation is to introduce the self-attention coefficients for each edge. We concatenate the representation of the source node and the destination node’s representation for representing edges. This concatenation process is enabled by the adjacency matrix *A*, which defines the relations between all the nodes in the graph.

# equation (2) - First part print('\n\n----- Concat hidden features to represent edges. Shape(len(emb.concat(emb)), number of edges)\n') edge_coords = np.where(A==1) h_src_nodes = z1[edge_coords[0]] h_dst_nodes = z1[edge_coords[1]] z2 = np.concatenate((h_src_nodes, h_dst_nodes), axis=1)

```
----- Concat hidden features to represent edges. Shape(len(emb.concat(emb)), number of edges)
[[
```**-0.3047382 0.3998854 0.03973776 -0.3047382 0.3998854 0.03973776**]
[-0.3047382 0.3998854 0.03973776 0.57624235 0.5029172 -0.31225734]
[-0.3047382 0.3998854 0.03973776 -0.12942953 0.02834577 -0.04941563]
[-0.3047382 0.3998854 0.03973776 -0.11941829 -0.21561317 0.47800217]
[ **0.57624235 0.5029172 -0.31225734 -0.3047382 0.3998854 0.03973776**]
[ 0.57624235 0.5029172 -0.31225734 0.57624235 0.5029172 -0.31225734]
[ 0.57624235 0.5029172 -0.31225734 -0.12942953 0.02834577 -0.04941563]
[ 0.57624235 0.5029172 -0.31225734 -0.4294049 0.19600584 -0.06529497]
[ 0.57624235 0.5029172 -0.31225734 -0.11941829 -0.21561317 0.47800217]
[**-0.12942953 0.02834577 -0.04941563 -0.3047382 0.3998854 0.03973776**]
[-0.12942953 0.02834577 -0.04941563 0.57624235 0.5029172 -0.31225734]
[-0.12942953 0.02834577 -0.04941563 -0.12942953 0.02834577 -0.04941563]
[-0.12942953 0.02834577 -0.04941563 -0.4294049 0.19600584 -0.06529497]
[-0.4294049 0.19600584 -0.06529497 0.57624235 0.5029172 -0.31225734]
[-0.4294049 0.19600584 -0.06529497 -0.12942953 0.02834577 -0.04941563]
[-0.4294049 0.19600584 -0.06529497 -0.4294049 0.19600584 -0.06529497]
[-0.4294049 0.19600584 -0.06529497 -0.11941829 -0.21561317 0.47800217]
[**-0.11941829 -0.21561317 0.47800217 -0.3047382 0.3998854 0.03973776**]
[-0.11941829 -0.21561317 0.47800217 0.57624235 0.5029172 -0.31225734]
[-0.11941829 -0.21561317 0.47800217 -0.4294049 0.19600584 -0.06529497]
[-0.11941829 -0.21561317 0.47800217 -0.11941829 -0.21561317 0.47800217]]

In the previous block, I have highlighted the 4 rows representing the 4 in-edges connected to node 1. The first 3 elements of each row define the embedding representation of node 1 neighbors, while the other 3 elements of each row define the embeddings of node 1 itself (as you can notice, the first row encodes a self-loop).

After this operation, we can introduce the attention coefficients and multiply them with the edge representation, resulting from the concatenation process. Finally, the Leaky Relu function is applied to the output of this product.

# equation (2) - Second part print('\n\n----- Attention coefficients. Shape(1, len(emb.concat(emb)))\n') att = np.random.rand(1, z2.shape[1]) print(att) print('\n\n----- Edge representations combined with the attention coefficients. Shape(1, number of edges)\n') z2_att = z2.dot(att.T) print(z2_att) print('\n\n----- Leaky Relu. Shape(1, number of edges)') e = leaky_relu(z2_att) print(e)

```
----- Attention coefficients. Shape(1, len(emb.concat(emb)))
[[
```**0.09834683 0.42110763 0.95788953 0.53316528 0.69187711 0.31551563**]]
----- Edge representations combined with the attention coefficients. Shape(1, number of edges)
[[** 0.30322275**]
[ 0.73315639]
[ 0.11150219]
[ 0.11445879]
[ 0.09607946]
[ 0.52601309]
[-0.0956411 ]
[-0.14458757]
[-0.0926845 ]
[ 0.07860653]
[ 0.50854017]
[-0.11311402]
[-0.16206049]
[ 0.53443082]
[-0.08722337]
[-0.13616985]
[-0.08426678]
[ 0.48206613]
[ 0.91199976]
[ 0.2413991 ]
[ 0.29330217]]
----- Leaky Relu. Shape(1, number of edges)
[[ **3.03222751e-01**]
[ 7.33156386e-01]
[ 1.11502195e-01]
[ 1.14458791e-01]
[ 9.60794571e-02]
[ 5.26013092e-01]
[-9.56410988e-04]
[-1.44587571e-03]
[-9.26845030e-04]
[ 7.86065337e-02]
[ 5.08540169e-01]
[-1.13114022e-03]
[-1.62060495e-03]
[ 5.34430817e-01]
[-8.72233739e-04]
[-1.36169846e-03]
[-8.42667781e-04]
[ 4.82066128e-01]
[ 9.11999763e-01]
[ 2.41399100e-01]
[ 2.93302168e-01]]

At the end of this process, we achieved a different score for each edge of the graph. In the upper block, I have highlighted the evolution of the coefficient associated with the first edge. Then, to make the coefficient easily comparable across different nodes, a softmax function is applied to all neighbors’ contributions for every destination node.

# equation (3) print('\n\n----- Edge scores as matrix. Shape(n,n)\n') e_matr = np.zeros(A.shape) e_matr[edge_coords[0], edge_coords[1]] = e.reshape(-1,) print(e_matr) print('\n\n----- For each node, normalize the edge (or neighbor) contributions using softmax\n') alpha0 = softmax(e_matr[:,0][e_matr[:,0] != 0]) alpha1 = softmax(e_matr[:,1][e_matr[:,1] != 0]) alpha2 = softmax(e_matr[:,2][e_matr[:,2] != 0]) alpha3 = softmax(e_matr[:,3][e_matr[:,3] != 0]) alpha4 = softmax(e_matr[:,4][e_matr[:,4] != 0]) alpha = np.concatenate((alpha0, alpha1, alpha2, alpha3, alpha4)) print(alpha) print('\n\n----- Normalized edge score matrix. Shape(n,n)\n') A_scaled = np.zeros(A.shape) A_scaled[edge_coords[0], edge_coords[1]] = alpha.reshape(-1,) print(A_scaled)

```
----- Edge scores as matrix. Shape(n,n)
[[ 3.03222751e-01 7.33156386e-01 1.11502195e-01 0.00000000e+00
1.14458791e-01]
[ 9.60794571e-02 5.26013092e-01 -9.56410988e-04 -1.44587571e-03
-9.26845030e-04]
[ 7.86065337e-02 5.08540169e-01 -1.13114022e-03 -1.62060495e-03
0.00000000e+00]
[ 0.00000000e+00 5.34430817e-01 -8.72233739e-04 -1.36169846e-03
-8.42667781e-04]
[ 4.82066128e-01 9.11999763e-01 0.00000000e+00 2.41399100e-01
2.93302168e-01]]
----- For each node, normalize the edge (or neighbor) contributions using softmax
[0.26263543 0.21349717 0.20979916 0.31406823 0.21610715 0.17567419
0.1726313 0.1771592 0.25842816 0.27167844 0.24278118 0.24273876
0.24280162 0.23393014 0.23388927 0.23394984 0.29823075 0.25138555
0.22399017 0.22400903 0.30061525]
----- Normalized edge score matrix. Shape(n,n)
[[
```**0.26263543 0.21349717 0.20979916 0. 0.31406823**]
[0.21610715 0.17567419 0.1726313 0.1771592 0.25842816]
[0.27167844 0.24278118 0.24273876 0.24280162 0. ]
[0. 0.23393014 0.23388927 0.23394984 0.29823075]
[0.25138555 0.22399017 0. 0.22400903 0.30061525]]

To interpret the meaning of the last matrix defining the normalized edge scores, let’s recap the adjacency matrix’s content.

```
----- Adjacency Matrix (undirected graph). Shape(n,n)
[[
```**1 1 1 0 1**]
[1 1 1 1 1]
[1 1 1 1 0]
[0 1 1 1 1]
[1 1 0 1 1]]

As you can see, instead of having 1 values to define edges, we rescaled the contribution of each neighbor. The final step is to compute the neighborhood aggregation: the embeddings from neighbors are incorporated into the destination node, scaled by the attention scores.

# equation (4) print('\n\nNeighborhood aggregation (GCN) scaled with attention scores (GAT). Shape(n, emb)\n') ND_GAT = A_scaled.dot(z1) print(ND_GAT)

```
Neighborhood aggregation (GCN) scaled with attention scores (GAT). Shape(n, emb)
[[
```**-0.02166863 0.15062515 0.08352843**]
[-0.09390287 0.15866476 0.05716299]
[-0.07856777 0.28521023 -0.09286313]
[-0.03154513 0.10583032 0.04267501]
[-0.07962369 0.19226439 0.069115 ]]

In the future article, I will describe the mechanisms behind the Multi-Head GAT Layer, and we will see some applications for the link prediction task.

- The running version of the code is available in the following notebook. You will also find a DGL implementation, which is useful to check the correctness of the implementation.
- The original paper on Graph Attention Networks from Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengiois available on arXiv.
- For a deep explanation of the topic, I also suggest the video from Aleksa Gordić.

*This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.*

We’ll let you know when we release more technical education.

The post Graph Attention Networks Under the Hood appeared first on TOPBOTS.

]]>The post Graph Transformer: A Generalization of Transformers to Graphs appeared first on TOPBOTS.

]]>We present Graph Transformer, a transformer neural network that can operate on arbitrary graphs.

- Background
- Objective
- Key Design Aspects for Graph Transformer
- Proposed Graph Transformer Architecture
- Remarks From the Experiments

Lets start with the two keywords, **Transformers **and **Graphs**, for a background.

Transformers [1] based neural networks are the most successful architectures for representation learning in Natural Language Processing (NLP) overcoming the bottlenecks of Recurrent Neural Networks (RNNs) caused by the sequential processing. As the core building block of Transformers, there exists the multi-head attention mechanism, represented by this formula:

Using multi-head attention, a word attends to each other word in a sentence and combines the received information to generate its abstract feature representations.

Graphs are ubiquitous data structures. There is a wide range of application domains where datasets can be represented as graphs. For example, molecular graphs in chemistry, interactions among particles in physics, drug protein interactions in medicine, user and their social and commercial connections in social media, problems in combinatorial optimization, etc.

For learning on graphs, graph neural networks (GNNs) have emerged as the most powerful tool in deep learning.

In short, GNNs consist of several parameterized layers, with each layer taking in a graph with node (and edge) features and builds abstract feature representations of nodes (and edges) by taking the available explicit connectivity structure (*i.e., *graph structure) into account. The so-generated features are then passed to downstream classification layers, usually MLPs, and the target property is predicted.

Now, the target is to generalize transformer neural networks to graphs so that it can learn on graphs and datasets with arbitrary structure rather than just the sequential (as can be interpreted to be done by the NLP Transformers).

To proceed with the objective, we focus on extending the key design principles of Transformers from NLP to graphs in general.

We find that attention using graph sparsity and positional encodings are two key design aspects for the generalization of transformers to arbitrary graphs.

Now, we discuss these from the contexts of both NLP and graphs to make the proposed extensions clear for Graph Transformer.

In NLP, Transformers consider full attention while building feature representations for words. That is, a transformer treats a sentence as a fully connected graph of words. This choice of full attention can be justified for *two *reasons:

First, it is difficult to find meaningful sparse interactions or connections among the words in a sentence. For instance, the dependency of a word in a sentence on another word can vary with context, user’s perspective, and application at hand. There can be numerous plausible ground truth connections among words in a sentence and therefore, text datasets of sentences do not often have explicit word interactions available. It thereby makes sense to perform full attention and let the model decide how the words depend on others.

Second, the so-interpreted fully connected graph in NLP often has less than tens of hundreds of nodes (*i.e.,* sentences are often less than tens or hundreds of words). On this size of graphs, attention to every node is computationally feasible in memory and time.

For these two reasons, full attention can be performed in NLP transformers and subsequent works [2,3,4] have shown it to be fruitful in language modeling and several NLP tasks.

However, in the case of actual graph datasets, graphs have arbitrary connectivity structure available based on the application domain, and have node sizes in ranges of up to millions, or even billions. The available structure presents us with a rich source of information to exploit while learning in the neural network, whereas the node sizes practically makes it impossible to have a fully connected graph for such datasets.

On these accounts, it is practical (for feasibility) and advantageous (for utlizing sparse structure information) to have a Graph Transformer where a node attends to local node neighbors, similar to Graph Attention Networks (GATs) [5].

In fact, the local information aggregation is at the core of GNNs, indicative of the fact that sparsity is a good inductive bias for generalization.

The attention mechanism in Transformer is invariant to the ordering of the nodes. It does not have any notion of where in the sequence (or the sentence) the word is located. This means that, the Transformer regards a multi-set of words rather than the sequence of words in NLP, as illustrated by the following comparison:

That would mean losing out some information about the ordering of the words, isn’t it?

To avoid this and making Transformer aware of the sequential information, some kind of positional encoding is necessary in the transformer. The original transformer by Vaswani et al. [1] uses sinusoidal positional encoding that is added to each word’s feature vector at the inputs. This helps encode the necessary prevalent (sequential) relationship among the words into the model.

We extend this critical design block of positional information encoding for Graph Transformer. In fact, a line of research in GNNs [6,7,8] has recently shown that positional information improves GNNs and overcomes the failure of GNNs for several fundamental tasks.

We therefore leverage the success of the recent works on positional information in GNNs and use Laplacian Positional Encodings [8] in Graph Transformer. We use precomputed Laplacian eigenvectors [9] to add into the node features before the first layer, similar to how positional encodings are added in the original Transformer [1].

Laplacian PEs are a natural generalization of the sinusoidal PEs used in the original transformer, as the sinusoidal PEs can be interpreted as eigenvectors of the line graph, which is the sentence in NLP.

Hence, sparse graph structure during attention and positional encodings at the inputs are the two important things we consider while generalizing transformers to arbitrary graphs.

We now present the proposed architecture — the Graph Transformer Layer and the Graph Transformer Layer with edge features. The schematic diagram of a layer shown as follows consist of the major components — the input with PEs, the multi-head attention mechanism with attention restricted to local neighbors, and the feed forward module.

Compared to the standard transformer of Vaswani et al. [1], the key differences (or the extensions) which generalizes transformer to graphs to result in a Graph Transformer are:

i) The attention mechanism is a function of the *neighborhood connectivity *for each node, which is shown by the formula:

ii) Positional Encoding is represented by *Laplacian PEs*. In particular, the eigenvectors of graph Laplacian are precomputed for every graph before training, and *k*-smallest non-trivial eigenvectors of a node are assigned as the node’s PE.

iii) The feed forward module employs *batch normalization *[10] instead of layer normalization [11] that was used in the original transformer [1]. This is supported by our empirical evidences that the use of batch normalization leads to better performance in place of layer normalization.

iv) Graph Transformer is extended to have *edge representation* (see the *Graph Transformer Layer with edge features* at right of the architecture diagram). This architecture can be critical to datasets with rich information along the edges, for instances — bond information along the edges in molecular graphs, or relationship types in knowledge graphs. There are two things to note in this edge-extended architecture: the edge features are fused to the corresponding pairwise implicit attention scores, and there is a designated edge feature pipeline at every layer, as shown by the layer update equations as follows:

This concludes the description of the proposed Graph Transformer. We refer to the paper for the complete layer update equations.

We evaluate Graph Transformer on benchmark datasets and validate our design choices, alongside an attempt to answer some open research questions on: i) whether local attention or full attention for generalization of transformers to graphs, ii) how to encode sparse structure information?, iii) positional encoding candidates.

We *remark* the following results:

a) As already discussed, graph sparsity is critical and its use in attention mechanism always gives better performance as compared to use full attention. The result that *sparsity *is a good inductive bias for graph datasets has already been shown by the success of GNNs on several application areas.

b) Among several combination of design choices on attention, use of PEs, normalization candidate, etc., the architecture *using* i) attention to local neighbors, ii) Laplacian PEs, and iii) batch normalization layer in feed forward module, has the best performance across all datasets used for evaluation. This empirically validates the choice of using these components for the targeted generalization of transformers.

c) Since Laplacian PEs have desirable properties of having i) distance-aware information, ii) distinguishable node features, and iii) being a generalization of original transformer’s PE’s to general graphs, it fares as a suitable PE candidate to be used as Graph Transformer, *even* empirically, as compared to the PE candidates used in the literature for research on Graph Transformer (the related works are discussed in detail in the paper).

d) Overall, Graph Transformer achieves competitive performance among the GNNs compared with on the evaluated datasets. The proposed architecture performs significantly better than baseline GNNs (GCNs [12] and GATs [5]), and helps close the gap between the original transformer and transformer for graphs. Thus, Graph Transformer emerges as a fresh powerful attention based GNN baseline and we hope can easily be extended for future research, provided its simplicity and straightforward generalization from transformers.

The tables of the numerical experiments are in the paper, the code implementation is open-sourced on GitHub, and an accompanying video presentation is on YouTube, with the corresponding links as follows:

Paper: https://arxiv.org/abs/2012.09699

GitHub: https://github.com/graphdeeplearning/graphtransformer

Video: https://youtu.be/h-_HNeBmaaU?t=240

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need.

[2] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.

[3] Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., (2018). Improving language understanding by generative pre-training.

[4] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., (2020). Language models are few-shot learners.

[5] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P. and Bengio, Y. (2018). Graph attention networks.

[6] Srinivasan, B. and Ribeiro, B., (2019). On the equivalence between positional node embeddings and structural graph representations.

[7] You, J., Ying, R. and Leskovec, J., (2019). Position-aware graph neural networks.

[8] Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y., and Bresson, X. (2020). Benchmarking graph neural networks.

[9] Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation.

[10] Ioffe, S. and Szegedy, C., (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift.

[11] Ba, J.L., Kiros, J.R. and Hinton, G.E., (2016). Layer normalization.

[12] Kipf, T. N., and Welling, M. (2016). Semi-supervised classification with graph convolutional networks.

We’ll let you know when we release more technical education.

The post Graph Transformer: A Generalization of Transformers to Graphs appeared first on TOPBOTS.

]]>The post Autoencoders: Overview of Research and Applications appeared first on TOPBOTS.

]]>Most early representation learning ideas revolve around linear models such as factor analysis, Principal Components Analysis (PCA) or sparse coding. **Since these approaches are linear, they may not be able to find disentangled representations of complex data such as images or text**. Especially in the context of images, simple transformations such as change of lighting may have very complex relationships to the pixel intensities. Therefore, there is a need for deep non-linear encoders and decoders, transforming data into its hidden (hopefully disentangled) representation and back.

**Autoencoders are neural network models designed to learn complex non-linear relationships between data points.** Usually, autoencoders consist of multiple neural network layers and are trained to reconstruct the input at the output (hence the name *auto*encoder). In this post, I will try to give an overview of the various types of autoencoders developed over the years and their applications.

In general, the assumption of using autoencoders is that the highly complex input data can be described much more succinctly if we correctly take into account the geometry of the data points. Consider, for instance, the so-called “swiss roll” manifold depicted in Figure 1. Although the data originally lies in 3-D space, it can be more briefly described by “unrolling” the roll and laying it out on the floor (2-D). Note that a linear transformation of the swiss roll is not able to unroll the manifold. However, **autoencoders are able to learn the (possibly very complicated) non-linear transformation function.**

A simple way to make the autoencoder learn a low-dimensional representation of the input is to **constrain the number of nodes in the hidden layer**. Since the autoencoder now has to reconstruct the input using a restricted number of nodes, it will try to learn the most important aspects of the input and ignore the slight variations (i.e. noise) in the data.

In order to implement an undercomplete autoencoder, at least one hidden fully-connected layer is required. Most autoencoder architectures nowadays actually employ multiple hidden layers in order to make the architecture deeper. Empirically, **deeper architectures are able to learn better representations and achieve better generalization.** It is also customary to have the number and size of layers in the encoder and decoder, making the architecture symmetric.

Undercomplete autoencoders do not necessarily need to use any explicit regularization term, since the network architecture already provides such regularization. However, we should nevertheless be careful about the actual capacity of the model in order to prevent it from simply memorizing the input data. One regularization option is to bind the parameters of the encoder and decoder together by simply using the transpose of the encoder weight matrix in the corresponding layer in the decoder.

Applications of undercomplete autoencoders include **compression, recommendation systems as well as outlier detection**. Outlier detection works by checking the reconstruction error of the autoencoder: if the autoencoder is able to reconstruct the test input well, it is likely drawn from the same distribution as the training data. If the reconstruction is bad, however, the data point is likely an outlier, since the autoencoder didn’t learn to reconstruct it properly.

Since convolutional neural networks (CNN) perform well at many computer vision tasks, it is natural to consider convolutional layers for an image autoencoder. Usually, pooling layers are used in convolutional autoencoders alongside convolutional layers to reduce the size of the hidden representation layer. The hidden layer is often preceded by a fully-connected layer in the encoder and it is reshaped to a proper size before the decoding step. Since the output of the convolutional autoencoder has to have the same size as the input, we have to resize the hidden layers. In principle, we can do this in two ways:

**Upsampling the hidden layer**before every convolutional layer, e.g. with bilinear interpolation, or- Using specialized
**transposed convolution layers**to perform a trainable form of upsampling.

The second option is more principled and usually provides better results, however it also increases the number of parameters of the network and may not be suitable for all kinds of problems, especially if there is not enough training data available.

Convolutional autoencoders are frequently used in **image compression and denoising**. In case of denoising, the network is called **denoising autoencoder **and it is trained differently to the standard autoencoder: instead of trying to reconstruct the input in the output, the input is corrupted by an appropriate noise signal (e.g. Gaussian noise) and the autoencoder is trying to predict the denoised output.

Convolutional autoencoders may also be used in **image search applications**, since the hidden representation often carries semantic meaning. Therefore, similarity search on the hidden representations yields better results that similarity search on the raw image pixels. It is also significantly faster, since the hidden representation is usually much smaller.

As I already mentioned, undercomplete autoencoders use an implicit regularization by constricting the size of the hidden layers compared to the input and output. Sparse autoencoders now introduce an explicit regularization term for the hidden layer. Therefore, the restriction that the hidden layer must be smaller than the input is lifted and we may even think of overcomplete autoencoders with hidden layer sizes that are larger than the input, but optimal in some other sense.

For example, we might introduce a L1 penalty on the hidden layer to obtain a sparse distributed representation of the data distribution. This will force the autoencoder select only a few nodes in the hidden layer to represent the input data. Note that this penalty is qualitatively different from the usual L2 or L1 penalties introduced on the weights of neural networks during training. In this case **we restrict the hidden layer values instead of the weights**. In contrast to weight decay, this procedure is not quite as theoretically founded, with no clear underlying probabilistic description. However, it is an intuitive idea and it works very well in practice.

Another penalty we might use is the KL-divergence. In this case, we introduce a sparsity parameter ρ (typically something like 0.005 or another very small value) that will denote the average activation of a neuron over a collection of samples. In our case, ρ will be assumed to be the parameter of a Bernoulli distribution describing the average activation. We will also calculate ρ_hat, the true average activation of all examples during training. The KL-divergence between the two Bernoulli distributions is given by:

, where s₂ is the number of neurons in the hidden layer. This is a differentiable function and may be added to the loss function as a penalty.

An interesting approach to regularizing autoencoders is given by the assumption that for very similar inputs, the outputs will also be similar. We can** enforce this assumption by requiring that the derivative of the hidden layer activations is small with respect to the input**. This will make sure that small variations of the input will be mapped to small variations in the hidden layer. The name contractive autoencoder comes from the fact that we are trying to **contract **a small cluster of inputs to a small cluster of hidden representations.

Specifically, we include a term in the loss function which penalizes the Frobenius norm (matrix L2-norm) of the Jacobian of the hidden activations w.r.t. the inputs:

Hereby, *h_j* denote the hidden activations, *x_i* the inputs and ||*||_*F* is the Frobenius norm.

The crucial difference between variational autoencoders and other types of autoencoders is that **VAEs view the hidden representation as a latent variable with its own prior distribution**. This gives them a proper Bayesian interpretation. Variational autoencoders are **generative models with properly defined prior and posterior data distributions.**

More specifically, the variational autoencoder models the joint probability of the input data and the latent representation as *p(x, z) = p(x|z) p(z). *The generative process is defined by drawing a latent variable from *p(z)* and passing it through the decoder given by *p(x|z)*. As with the other autoencoder types, the decoder is a learned parametric function.

In order to find the optimal hidden representation of the input (the encoder), we have to calculate *p(z|x) = p(x|z) p(z) / p(x) *according to Bayes’ Theorem. The issue with applying this formula directly is that the denominator requires us to marginalize over the latent variables. In other words, we have to compute the integral over all possible latent variable configurations. This is usually intractable. Instead, we turn to variational inference.

**In variational inference, we use an approximation q(z|x) of the true posterior p(z|x). q(z|x) is explicitly designed to be tractable.** In our case,

We may recognize the first term as the maximal likelihood of the decoder with *n* samples drawn from the prior (encoder). The second term is new for variational autoencoders: it tries to approximate the variational posterior *q* to the true prior *p* using the KL-divergence as a measure. Furthermore, *q* is chosen such that it factorizes over the *m* training samples, which makes it possible to train using stochastic gradient descent. While this is intuitively understandable, you may also derive this loss function rigorously. If you are familiar with Bayesian inference, you may also recognize the loss function as maximizing the Evidence Lower BOund (ELBO).

We usually choose a simple distribution as the prior *p(z)*. In many cases, it is simply the univariate Gaussian distribution with mean 0 and variance 1 for all hidden units, leading to a particularly simple form of the KL-divergence (please have look here for the exact formulas). *q* is also usually chosen as a Gaussian distribution, univariate or multivariate.

The only thing remaining to discuss now is how to train the variational autoencoder, since the loss function involves sampling from *q*. The sampling operation is not differentiable. Luckily, the distribution were are trying to sample from is continuous. This allows us to use a trick: **instead of backpropagating through the sampling process, we let the encoder generate the parameters of the distribution** (in the case of the Gaussian, simply the mean μ and the variance σ). Then we generate a sample from the unit Gaussian ε and rescale it with the generated parameter:

Since we do not need to calculate gradients w.r.t ε and all other derivatives are well-defined, we are done. This is called the **reparametrization trick**. Note that the reparameterization trick works for many continuous distributions, not just for Gaussians. Unfortunately, though,** it doesn’t work for discrete distributions** such as the Bernoulli distribution.

After training, we have two options: (i) forget about the encoder and only **use the latent representations to generate new samples from the data distribution **by sampling and running the samples through the trained decoder, or (ii) **running an input sample through the encoder, the sampling stage as well as the decoder**. If we choose the first option, we will get unconditioned samples from the latent space prior. With the second option, we will get posterior samples conditioned on the input.

This already motivates the main application of VAEs: generating new images or sounds similar to the training data. When generating images, one usually uses a convolutional encoder and decoder and a dense latent vector representation. Multiple different versions of variational autoencoders appeared over the years, including Beta-VAEs which aim to generate a particularly disentangled representations, VQ-VAEs to overcome the limitation of not being able to use discrete distributions as well as conditional VAEs to generate outputs conditioned on a certain label (such as faces with a moustache or glasses). See Figure 3 for an example output of a recent variational autoencoder incarnation.

Autoencoders form a very interesting group of neural network architectures with many applications in computer vision, natural language processing and other fields. Although nowadays there are certainly other classes of models used for representation learning nowadays, such as siamese networks and others, autoencoders remain a good option for a variety of problems and I still expect a lot of improvements in this field in the near future.

We’ll let you know when we release more technical education.

The post Autoencoders: Overview of Research and Applications appeared first on TOPBOTS.

]]>The post Variational Methods in Deep Learning appeared first on TOPBOTS.

]]>Bayesian variational inference provides a natural framework for these issues, since the very idea of Bayesian learning is to infer the shapes of distributions instead of point estimates of parameters. Unfortunately, the added complexity of this approach makes it hard to use in deep neural networks.

In this post, I will try to show how we can overcome these difficulties through an approach known as **probabilistic programming**. This method allows us to largely automatize the process of statistical inference in the models, making it easy to use without having to know all the tricks and intricacies of Bayesian inference in large models.

Variational inference is an essential technique in Bayesian statistics and statistical learning. It was originally developed as an alternative to Monte-Carlo techniques. Like Monte-Carlo, variational inference allows us to sample from and analyze distributions that are too complex to calculate analytically.

In variational inference, we use a distribution which is easy to sample from and adjust its parameters to resemble a target posterior distribution as closely as possible. Surprisingly, we may perform this approximation even though we do not know the target distribution exactly.

More concretely, let us assume that we are given a set of latent (i.e. unobserved) variables ** Z_1, …, Z_n** and some data

Let us now define a parametrized statistical model including both ** Z** and

In this context, P(Z) is known as the** prior**, while P(X|Z) is the **likelihood **of the data given the latents. The learning criterion is usually assumed as maximizing the log-evidence:

Hereby, P(X) is called the evidence, since it describes the probability of the data (evidence) with parameters θ. P(X) is defined as:

Unfortunately, this integral is usually intractable even for known values of θ. If we tried to maximize the log-evidence directly, we would have to calculate the integral anew for every value of θ during training.

During training, we are also interested to calculate the probability of the latent variables given the data, which is given by the Bayes theorem as:

This probability is called the **posterior** in Bayesian literature and the procedure of calculating it is often referred to as inference. Note that this quantity is intractable due to the evidence term in the denominator.

In variational inference, we now define a **variational distribution** ** Q(Z)** to approximate the posterior for the given data:

We define ** Q** such that we can easily sample from it. In principle, we are free to take any distribution we like, however if

- maximize the log-evidence,
- make
approximate*Q*as closely as possible.*P*

Although this task now seems even more difficult than the original one, I will show you how to efficiently solve it using gradient descent by defining an appropriate loss function.

As mentioned above, our goal is to make ** Q** as close to

- a term that maximizes the log-evidence, even though indirectly,
- some measure of closeness of
and*Q*.*P*

It turns out that the **Kullback-Leibler divergence** (KL-divergence) is a good measure for the distance between distributions. It is defined as

By shuffling the terms of this equation around (you can find the precise derivation on Wikipedia) we arrive at the equation

This equation is significant since it tells us that we can write the intractable log evidence as the KL-divergence between Q and P minus a term we will call **Evidence Lower BOund (ELBO)**. Since the KL-divergence is non-negative, it follows that maximizing the ELBO will also maximize the evidence. Our loss function is thus:

The terms inside the expectation are easy to calculate since we have everything we need: the log-joint is simply the sum of the log-prior and the log-likelihood, and log-Q is tractable by definition. We will look at how to optimize the ELBO in the next section.

Recently, a new class of log-evidence bounds emerged called Importance-Weighted Lower Bound (IWLB) defined as

This bound is equal to ELBO for K=1 and otherwise is tigher than the ELBO. For more information please refer to [1].

Let us now figure out how to estimate the gradient of ELBO using arbitrary stochastic functions ** P** and

In other words, we have to get the gradient computation inside the expectation. How do we do this in general, if the expectation depends on the gradient parameters?

If ** Q** has a particular form, it turns out that we may circumvent the problem by

As you can see, the second expectation does not depend on φ anymore. Therefore, we may pull the gradient computation inside the expectation. As an example, consider the reparameterization of the normal distribution:

Since N(0,1) does not depend on any parameters, we may freely differentiate w.r.t. the operation.

Unfortunately, this trick does not work for all distributions. In particular, it fails with discrete distributions. In this case, our only hope is a so-called REINFORCE estimator. This estimator uses the following equation:

Therefore, we may rewrite the expectation gradient as

This solves our issue of differentiability and provides us with a Monte-Carlo estimator. Unfortunately, this estimator tends to have a large variance. In some cases, it is not even possible to efficiently calculate the gradient at all. Fortunately, it is often possible to reduce the variance of the estimator, using e.g. the structure of the model or a baseline reduction similar to the one used in policy gradient methods in reinforcement learning. For more details, please refer to [2].

I will now present an application of the above variational framework: the variational autoencoder [3]. The variational autoencoder is a directed probabilistic generative model. If you are unfamiliar with the basics of variational autoencoders, have a look at this great tutorial.

At runtime, the variational autoencoder takes a random value sampled from a prior ** P(Z)** and passes it through a neural network called the decoder

We may write the joint probability density of the variational autoencoder as

This means that for every data point ** Xi** we have a point in latent space

In order to avoid this and learn a single set of parameters, we introduce another neural network, the encoder ** Q(Z|X)**, to represent a variational estimate of the posterior

In order to learn both the optimal latent space as well as minimize the reconstruction error in the autoencoder by minimizing the ELBO in this model. Since we are usually working with a continuous distribution, we can use the reparametrization trick in order to compute the derivatives. The ELBO can be written as

The first term actually corresponds to the reconstruction error, so it could be the mean squared error. The second term minimizes the distance between the encoder and the prior given the data point.

Furthermore, since the joint distribution ** P(X,Z)** factors over all data points, we may freely use mini-batch sampling as we usually do for feed-forward neural networks. You may also use your favorite optimizer (SGD, Adam etc.) for the optimization procedure. One implementation detail: please be aware that most implementations of optimizers only allow you to minimize a value instead of maximizing as we are doing here. In this case, you can just minimize the negative of the ELBO and you’re good to go.

This blog post focused on the applications of variational inference in deep learning. As you hopefully saw, variational inference can be automatized to a large degree by solving gradient estimation for a wide variety of distributions. However, one always has to be mindful of the estimator variance, especially when using the REINFORCE estimator. Fortunately, frameworks such as Pyro make variational inference and probabilistic reasoning simple to use and often also take care of variance reduction and other tricks.

[1] Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. “Importance weighted autoencoders.” *arXiv preprint arXiv:1509.00519* (2015).

[2] http://pyro.ai/examples/svi_part_iii.html

[3] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” *arXiv preprint arXiv:1312.6114* (2013).

We’ll let you know when we release more technical education.

The post Variational Methods in Deep Learning appeared first on TOPBOTS.

]]>The post Step-By-Step Implementation of GANs on Custom Image Data in PyTorch: Part 2 appeared first on TOPBOTS.

]]>In case you would like to follow along, here is the **Github Notebook** containing the source code for training GANs using the PyTorch framework.

The whole idea behind training a GAN network is to obtain a Generator network (with most optimal model weights and layers, etc.) that is excellent at spewing out fakes that look like real!

*Note: I would like to take a moment to truly appreciate Nathan Inkawhich for writing a **superb article** explaining the inner workings of DCGANs and the official **Github** repository for Pytorch that helped me with the code implementations, especially the network architectures for both Generator and Discriminator. Hopefully, the explanations I have presented in this article help you gain even further clarity (than already present in the aforementioned blogs) regarding GANs and implement them even better for your own use-case!*

**If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material. **

One of the main reasons I started writing this article was because I wanted to try coding GANs on a custom image dataset. Most tutorials I came across were using one of the popular datasets (such as MNIST, CIFAR-10, Celeb-A, etc) that come pre-installed into the framework and ready to be used out-of-the-box. Instead, we will be working with the Human Faces dataset available on Kaggle, containing approximately 7k images — with a wide variety of side/frontal poses, age groups, gender, etc — that were scraped from the web.

After unzipping and loading the image folder (called `Humans`

) into your current working directory, let’s start writing our code in the Jupyter Notebook:

# path to the image directory dir_data = "Humans" # setting image shape to 32x32 img_shape = (32, 32, 3) # listing out all file names nm_imgs = np.sort(os.listdir(dir_data))

We have downsized our high-def images to a smaller resolution i.e. 32×32 for ease of processing. Once you are through with this tutorial, you are free to re-run the code after changing the `img_shape`

parameter and a few other things (which we will discuss towards the end of the article).

Next, we will be converting all our images into NumPy arrays and store them collectively into `X_train`

. Also, it is always a good idea to explicitly convert images into RGB format (just in case some image *looks* grayscale but in reality, isn’t!).

X_train = [] for file in nm_imgs: try: img = Image.open(dir_data+'/'+file) img = img.convert('RGB') img = img.resize((32,32)) img = np.asarray(img)/255 X_train.append(img) except: print("something went wrong") X_train = np.array(X_train) X_train,shape

```
************* OUTPUT ***********
(7218, 32, 32, 3)
```

Beware: there are some filenames that do not contain any image (or must be corrupted) and that’s why I tend to use try-except blocks while coding.

*Note: The process of converting images to their respective NumPy format might take a while. Hence, it is a good idea to store **X_train** locally as **.npy** file for future use. To do so:*

from numpy import asarray from numpy import savez_compressed # save to npy file savez_compressed('kaggle_images_32x32.npz', X_train)

To re-load the file at a later time:

# load dict of arrays dict_data = np.load('kaggle_images_32x32.npz') # extract the first array data = dict_data['arr_0'] # print the array print(data)

import matplotlib.pyplot as plt import numpy as np import torch import torch.nn as nn from torch.utils.data import Dataset, DataLoader import torch.optim as optim from torch.nn import Module, Sequential, Conv2d, ConvTranspose2d, LeakyReLU, BatchNorm2d, ReLU, Tanh, Sigmoid, BCELoss %matplotlib inline

While the code we are working on today will run on both CPU and GPU, it is always advisable to check availability and use the latter, when possible.

# Always good to check if gpu support available or not dev = 'cuda:0' if torch.cuda.is_available() == True else 'cpu' device = torch.device(dev)

Using `torch.cuda.is_available()`

check if GPU is available and if so, set it as the device using `torch.device`

function.

We mainly require one helper function `plot_images()`

that takes as input a NumPy array of images and displays images in a 5×5 grid.

# plot images in a nxn grid def plot_images(imgs, grid_size = 5): """ imgs: vector containing all the numpy images grid_size: 2x2 or 5x5 grid containing images """ fig = plt.figure(figsize = (8, 8)) columns = rows = grid_size plt.title("Training Images") for i in range(1, columns*rows +1): plt.axis("off") fig.add_subplot(rows, columns, i) plt.imshow(imgs[i]) plt.show()

In order to layout images in the form of a grid network, we add a subplot to our plotting area for each image we want to display. Using `fig.add_subplot`

, the subplot will take the *ith* position on a grid with `r`

rows and

columns. Finally, the entire grid can be displayed using *c*`plt.show()`

.

To see if our function works as intended, we can display a few images from our training set:

# load the numpy vector containing image representations imgs = np.load('kaggle_images_32x32.npz') # pls ignore the poor quality of the images as we are working with 32x32 sized images. plot_images(imgs['arr_0'], 3)

*Note: Kindly ignore the distorted quality (as compared to the original data we saw on Kaggle) since we are working with 32×32 images instead of superior resolution!*

I know what you’re thinking — *why do I need to create a special class for my dataset? What’s wrong with using my dataset as is?*

Well, the simple answer is — that’s just how PyTorch likes it! For a detailed answer, you can read this article here which nicely explains how to use the `torch.utils.data.Dataset`

class in PyTorch to create a custom `Dataset`

object for any dataset.

At a very basic level, the `Dataset`

class you extend for your own dataset should have `__init__`

,`__len__()`

and `__getitem__ `

methods.

In case you need further help creating the dataset class, do check out the PyTorch documentation here.

class HumanFacesDataset(Dataset): """Human Faces dataset.""" def __init__(self, npz_imgs) """ Args: npz_imgs (string): npz file with all the images (created in gan.ipynb) """ self.imgs = npz_imgs def __len__(self): return len(self.imgs) def __getitem__(self, idx): if torch.is_tensor(idx): idx = idx.tolist() image = self.imgs[idx] return image

Again — *why the heck I need Dataloader?*

Of the many reasons listed on the documentation page such as customizing data loading order and automatic memory pinning — the Dataloader is essentially useful for creating batches (for both train and test set) to be sent as input to the model.

This is how a DataLoader is defined in PyTorch:

dataloader = DataLoader(dataset = dset, batch_size = batch_size, shuffle = shuffle)

`dset`

is basically an object of the `HumanFacesDataset`

class we created earlier. Let’s define it, along with our `batch_size`

and `shuffle`

variables.

# Preparing dataloader for training transpose_imgs = np.transpose( # imp step to convert image size from (7312, 32,32,3) to (7312, 3,32,32) np.float32(imgs['arr_0']), # imp step to convert double -> float (by default numpy input uses double as data type) (0, 3,1,2) # tuple to describe how to rearrange the dimensions ) dset = HumanFacesDataset(transpose_imgs) # passing the npz variable to the constructor class batch_size = 32 shuffle = True dataloader = DataLoader(dataset = dset, batch_size = batch_size, shuffle = shuffle)

An important thing to note is that while creating the constructor for the class, we do not simply pass the image array to it. Instead, we pass `transpose_imgs`

which has some computations performed on our original image array. Mainly, we are going to

- (a) explicitly set the image representations to float using
`np.float32()`

— this is because by default NumPy input uses double as data type (you can verify this for a single image using`imgs['arr_0'][0].dtype`

and the output would be`float64`

) and the model we will be creating will have weights as`float32`

; and - (b) we reorder the dimensions of each image from (32 x 32 x 3) to (3 x 32 x 32) using
`np.tranpose()`

because that’s how the layers in PyTorch models expect them to be.

Finally, we opt for a small `batch_size`

and set `shuffle`

as `True`

to rule out the possibility of any bias at the time of data collection.

As discussed in Part 1, the Generator is a neural network that is trying to produce (hopefully) realistic-looking images. To do so, it takes as input a random noise vector *z* (say a vector of size 100; where the choice of 100 is arbitrary), passes it through several hidden layers in the network, and finally outputs an RGB image with the same size as the training images i.e. a tensor of shape (3, 32, 32).

Frankly speaking, it took me a while to grasp the precise permutation and combination of layers (and associated parameter values) that should go into my Generator class. To lay it down in simpler terms for you, we will be working with the following layers only:

- ConvTranspose2d is the MVP that is going to help you upsample your random noise to create an image i.e. going from 100x1x1 to 3x32x32.

From the documentation, one can see that it takes the form:

ConvTranspose2d(in_channels, out_channels, kernel_size, stride, padding, bias)

- BatchNorm2d layer, as the name suggests, is used for applying batch normalization over the input. Going by the documentation, it takes as input the number of features or
`num_features`

which can be easily calculated based on the shape of the output from the preceding layer. Mainly,**its value is**. For instance, if the shape of the output from the previous layer is*C*if the expected input is of size (N, C, H, W)`batch_size, 512, 4, 4)`

then`num_features = 512`

. - ReLU or the Rectified Linear Unit is the activation function used in the Generator network. In simpler terms, this layer will output the input directly if it is positive, otherwise, it will output zero.
- Tanh is another activation function that is applied at the very end of the Generator network to transform the input into the [-1, 1] range.

Finally, the Generator class with all the aforementioned layers would look something like this. In case you need a beginner’s guide on how to create networks in Pytorch, check out this article here.

At a very basic level, the

class you extend for your own network model should have *Module*

, and *__init__*

methods.*forward *

class Generator(Module): def __init__(self): # calling constructor of parent class super().__init__() self.gen = Sequential( ConvTranspose2d(in_channels = 100, out_channels = 512 , kernel_size = 4, stride = 1, padding = 0, bias = False), # the output from the above will be b_size ,512, 4,4 BatchNorm2d(num_features = 512), # From an input of size (b_size, C, H, W), pick num_features = C ReLU(inplace = True), ConvTranspose2d(in_channels = 512, out_channels = 256 , kernel_size = 4, stride = 2, padding = 1, bias = False), # the output from the above will be b_size ,256, 8,8 BatchNorm2d(num_features = 256), ReLU(inplace = True), ConvTranspose2d(in_channels = 256, out_channels = 128 , kernel_size = 4, stride = 2, padding = 1, bias = False), # the output from the above will be b_size ,128, 16,16 BatchNorm2d(num_features = 128), ReLU(inplace = True), ConvTranspose2d(in_channels = 128, out_channels = 3 , kernel_size = 4, stride = 2, padding = 1, bias = False), # the output from the above will be b_size ,3, 32,32 Tanh() ) def forward(self, input): return self.gen(input)

*Note: It is very important to understand what is going on under-the-hood of this network because in case you want to work with images, say of size 64×64 or 128×128, this architecture (mainly the associated parameters) must be updated!*

To begin with, we are creating a Sequentialmodelwhich is a linear stack of layers. That is, the output from each layer acts as the input for the next layer.

In the first convolution, we begin with a `ConvTranspose2d`

layer that takes 100 `input_channels`

. Why 100? You might ask. This is because the input to the Generator network is going to be something like `batch_size, 100, 1, 1`

, which according to PyTorch roughly translates to a 1×1 image with 100 channels. Consequently, these many channels will go into the `ConvTranspose2d`

layer and so, `in_channels = 100`

.

The logic behind setting `out_channels`

as 512 is completely arbitrary, something I picked up from the tutorials/blogs I mentioned in the beginning. The idea is to pick a large number for `out_channels`

in the beginning and subsequently, reduce it (by a factor of 2) for each `ConvTranspose2d`

layer, until you reach the very last layer where you can set `out_channels = 3`

, which is the precise number of channels we require to generate an RGB image of size 32×32.

Now, the output from this layer will have a spatial size of `b_size, out_channels, height, width)`

, where height and width can be calculated according to the formula given on the documentation page.

Plugging in the respective values in the formula above we get:

*H_out = (1–1) * 1 – 2 * 0 + 1 * (4 –1) + 0 + 1 ; andW_out = (1–1) * 1 – 2 * 0 +1 * (4 –1) + 0 + 1*

Or,*H_out = 4W_out = 4*

And that is what you see as the spatial size (written as comments in the code above):

ConvTranspose2d(in_channels = 100, out_channels = 512, kernel_size = 4, stride = 1, padding = 0, bias = False), # the output from the above will be b_size ,512, 4,4

Now if you are not a mathematics wizard or feeling a little lazy to do the above calculations, you can even check the output from a layer by simply creating a dummy Generator network and passing it any random input.

For instance, we will be creating a dummy network with just the first convolutional layer:

class ExampleGenerator(Module): def __init__(self): # calling constructor of parent class super().__init__() self.gen = Sequential( ConvTranspose2d( in_channels = 100, out_channels = 512 , kernel_size = 4, stride = 1, padding = 0, bias = False) ) def forward(self, input): return self.gen(input) # defining class object ex_gen = ExampleGenerator() # defining random input for the model: b_size = 2 here t = torch.randn(2, 100, 1, 1) # checking the shape of the output from model ex_gen(t).shape

```
************* OUTPUT ***********
(2, 512 , 4 , 4)
```

The next layer we come across is the `BatchNorm2d`

. Now if you have been following the tutorial carefully, it should be amply clear why `num_features`

is set to 512. To recap, it’s because the output from the previous layer has a spatial size of `(b_size, 512, 4 , 4)`

.

Finally, we conclude the first (of the four) convolution with a ReLU activation.

The rest of the three convolutions follow the same pattern, more or less. I strongly encourage you to test out the calculations by hand to see how the spatial size of the input changes when it passes through a layer. This will help you set the values correctly for `in_channels`

, `out_channels`

, `stride`

, `kernel`

, etc in your Generator and Discriminator network when you are working with a differently-sized image dataset (i.e. something other than 32x32x3).

An important thing to note here is that `BatchNorm2d`

, `ReLU`

and `Tanh`

layers do not alter the spatial size of the input and that is why the second `ConvTranspose2d`

layer in the network begins with `in_channels = 512`

.

As discussed in Part 1, Discriminator is essentially a binary classification network that takes as input an image and returns a scalar probability that the output is real (as opposed to fake).

The main layers involved in a Discriminator network are as follows:

- Conv2d : as opposed to
`ConvTranspose2d`

layer which helps in upscaling an image, a`Conv2d`

layer helps in downscaling an image, i.e. reducing an image of size 32×32 to 16×16 to 8×8 and so on.. continuing all the way until we are left with 1×1, i.e. a scalar value. - LeakyReLU: A major advantage of using
`LeakyReLU`

over`ReLU`

layer is that it solves the vanishing gradient problem. In simpler terms, when the input is negative, a`ReLU`

layer will output a 0 whereas`LeakyReLU`

will output a non-zero value. Consequently,`LeakyReLU`

will contribute towards a small gradient update (instead of a zero gradient), when input is negative and so the model can continue to learn and be updated.

- Sigmoid: This is another activation layer through which we pass our inputs to transform our data in the [0,1] range.

Finally, this is how the discriminator class looks like:

# Defining the Discriminator class class Discriminator(Module): def __init__(self): super().__init__() self.dis = Sequential( # input is (3, 32, 32) Conv2d(in_channels = 3, out_channels = 32, kernel_size = 4, stride = 2, padding = 1, bias=False), # ouput from above layer is b_size, 32, 16, 16 LeakyReLU(0.2, inplace=True), Conv2d(in_channels = 32, out_channels = 32*2, kernel_size = 4, stride = 2, padding = 1, bias=False), # ouput from above layer is b_size, 32*2, 8, 8 BatchNorm2d(32 * 2), LeakyReLU(0.2, inplace=True), Conv2d(in_channels = 32*2, out_channels = 32*4, kernel_size = 4, stride = 2, padding = 1, bias=False), # ouput from above layer is b_size, 32*4, 4, 4 BatchNorm2d(32 * 4), LeakyReLU(0.2, inplace=True), Conv2d(in_channels = 32*4, out_channels = 32*8, kernel_size = 4, stride = 2, padding = 1, bias=False), # ouput from above layer is b_size, 256, 2, 2 # NOTE: spatial size of this layer is 2x2, hence in the final layer, the kernel size must be 2 instead (or smaller than) 4 BatchNorm2d(32 * 8), LeakyReLU(0.2, inplace=True), Conv2d(in_channels = 32*8, out_channels = 1, kernel_size = 2, stride = 2, padding = 0, bias=False), # ouput from above layer is b_size, 1, 1, 1 Sigmoid() ) def forward(self, input): return self.dis(input)

It might look very similar to the Generator network and in some sense it is. That is, we are again working with a Sequential network that contains four strided convolutions. However, we are now dealing with `Conv2d`

layers instead of `ConvTranspose2d`

layers. Within each of them, we set `out_channels`

to take up a small value initially and we gradually increase it by a factor of 2 until we reach our desired 1×1 image (i.e. *H_out* = *W_out* = 1), at which time we set `out_channels = 1`

.

An important thing to note here is that the shape of the input going into the last convolution layer is `(b_size, 256, 2, 2)`

. Because the `kernel_size`

must always be smaller than the spatial size of the input (in this case 2×2), we must set `kernel_size = 2`

for the last layer (as opposed to `kernel_size = 4`

in previous layers). Failure to do so will result in a Runtime error!

You might be wondering what happened to the scalar value that was promised as the output from the Discriminator whereas what we have here is a Tensor of shape `b_size, 1, 1, 1`

(i.e. output from the final layer). The good news is that we can easily convert this to a single vector containing `b_size`

values in it using `view(-1)`

. For instance, `t.view(-1)`

reshapes a 4-d tensor `t`

of shape `(2,1,1,1)`

to a 1-d tensor with 2 values only. We will be seeing its usage in action in the later sections!

Now that we have defined the classes for both the network, we can initialize an object for each of them.

# creating gen and disc netG = Generator().to(device) netD = Discriminator().to(device)

It is essential to initialize a neural network with random weights rather than letting them all be 0. That’s because all neurons with the same initial weight will learn the same features during training i.e. during subsequent iterations weights will be the same. In short, no improvements to the model whatsoever!

Based on the several blog posts I came across, the weights for the `ConvTranspose2d`

layer will be randomly initialized from a Normal distribution with mean=0, standard deviation=0.02. For the `BatchNorm2d`

layer, the mean and standard deviation of the distribution will be 1 and 0.02, respectively. This applies to both Generator and Discriminator networks.

To simultaneously initialize all the different layers in a network, we need to:

(a) define a function that takes as input a model

def init_weights(m): if type(m) == ConvTranspose2d: nn.init.normal_(m.weight, 0.0, 0.02) elif type(m) == BatchNorm2d: nn.init.normal_(m.weight, 1.0, 0.02) nn.init.constant_(m.bias, 0)

(b) then use `.apply()`

to recursively initialize all layers

# initializing the weights netD.apply(init_weights) netG.apply(init_weights)

Optimizers are useful for performing parameter updates in a network using the `optimizer.step`

method.

# Setting up otimizers for both Generator and Discriminator opt_D = optim.Adam(netD.parameters(), lr = 0.0002, betas= (0.5, 0.999)) opt_G = optim.Adam(netG.parameters(), lr = 0.0002, betas= (0.5, 0.999))

In order to check how far the predicted label for an image is from the real label, we will be using `BCELoss`

.

# Setting up the loss function - BCELoss (to check how far the predicted value is from real value) loss = BCELoss()

In Part 1, we discussed the main steps involved in training a GAN. To refresh our memory, here is the **pseudocode (**generated using the open-source code made available by PyTorch):

for each epoch: for each batch b of input images: ###################################### ## Part 1: Update Discriminator - D ## ###################################### # loss on real images clear gradients of D pred_labels_real = pass b through D to compute outputs true_labels_real = [1,1,1....1] calculate loss(pred_labels_real, true_labels_real) calculate gradients using this loss # loss on fake images generate batch of size b of fake images (b_fake) using G pred_labels_fake = pass b_fake through D true_labels_fake = [0,0,....0] calculate loss(pred_labels_fake, true_labels_fake) calculate gradients using this loss update weights of D ###################################### #### Part 2: Update Generator - G #### ###################################### clear gradients of G pred_labels = pass b_fake through D true_labels = [1,1,....1] calculate loss(pred_labels, true_labels) calculate gradient using this loss update weights of G ################################################ ### Part 3: Plot a batch of Generator images ### ################################################

Now keeping this in mind, let’s start building our training function step-by-step. The coding will be divided into three parts — Part 1 dedicated to updating the discriminator, Part 2 for updating the generator, and (an optional) Part 3 for plotting a batch of generator images using the helper function we defined at the beginning of the article.

The process includes calculating loss on real and fake images.

**Code for calculating loss on real images**

# Loss on real images # clear the gradient opt_D.zero_grad() # set the gradients to 0 at start of each loop because gradients are accumulated on subsequent backward passes # compute the D model output yhat = netD(b.to(device)).view(-1) # view(-1) reshapes a 4-d tensor of shape (2,1,1,1) to 1-d tensor with 2 values only # specify target labels or true labels target = torch.ones(len(b), dtype=torch.float, device=device) # calculate loss loss_real = loss(yhat, target) # calculate gradients - or rather accumulation of gradients on loss tensor loss_real.backward()

We begin by clearing the gradients for the discriminator using `zero_grad()`

. It is necessary to set the gradients to 0 at the start of each loop because the gradients are accumulated on subsequent backward passes (i.e. when `loss.backward()`

is called). Next, we store the output from the discriminator model when it is fed a batch `b`

of real images (i.e. images from our training set). Remember, the shape of `b`

is (32, 3, 32, 32).

It is important to note that rather than simply passing the images to the network as `netD(b)`

, we use `b.to(device)`

first on the batch. This is because we must put the image tensor on the same device as the model. While it may not matter much if you are running your code on a CPU, not doing so may throw a runtime error on a GPU.

Finally, as previously stated, we use `view(-1)`

on the output from the model to reshape the 4-d tensor to a 1-d tensor that contains the likelihood of the image is real.

We define the true labels for the real images or `targets`

as a tensor of size `b`

containing all 1s. We explicitly set this to `float32`

so that it matches the type of images in the batch `b`

. Finally, we ensure the target labels are also on the same device as the model.

Next, `BCELoss`

is calculated with the predicted values and target labels and gradients are calculated and accumulated using `backward()`

.

**Code for calculating loss on fake images**

# Loss on fake images # generate batch of fake images using G # Step1: creating noise to be fed as input to G noise = torch.randn(len(b), 100, 1, 1, device = device) # Step 2: feed noise to G to create a fake img (this will be reused when updating G) fake_img = netG(noise) # compute D model output on fake images yhat = netD.cuda()(fake_img.detach()).view(-1) # .cuda() is essential because our input i.e. fake_img is on gpu but model isnt (runtimeError thrown); detach is imp: Basically, only track steps on your generator optimizer when training the generator, NOT the discriminator. # specify target labels target = torch.zeros(len(b), dtype=torch.float, device=device) # calculate loss loss_fake = loss(yhat, target) # calculate gradients loss_fake.backward() # total error on D loss_disc = loss_real + loss_fake # Update weights of D opt_D.step()

To generate a batch of fake images, we first need a batch of random noise vectors, `noise`

, which is fed to the Generator to create `fake_img`

. Next, we calculate the output from the Discriminator on these fake images and store it in `yhat`

. `cuda()`

is essential in case our input i.e. `fake_img`

is on GPU but the model is not, in which case a runtime error is thrown.

An important thing to note is that we used `detach()`

on the batch of fake images. The reason for doing so is that while we want to *use the *services of the Generator, **but** we do not want to *update *it just yet (we will do once we are done updating the Discriminator).

Why use `detach()`

? Basically, we must track steps on our generator optimiser **only** when training the generator, **NOT** the discriminator.

Based on the explanation in Part 1, the target labels in this case would be a tensor of length `b`

containing all zeros. The remaining steps remain the same as in the previous code snippet.

The steps are roughly the same as in the case of Discriminator. The main difference is that now target labels are set to ones (instead of zeros), even though they are fake images. A detailed explanation of why we are doing so has been provided in Part 1. In short:

the Generator wants the Discriminator

to thinkit is churning out real images, and so it uses the true labels as 1 during training.

########################## #### Update Generator #### ########################## # clear gradient opt_G.zero_grad() # pass fake image through D yhat = netD.cuda()(fake_img).view(-1) # specify target variables - remember G wants D *to think* these are real images so label is 1 target = torch.ones(len(b), dtype=torch.float, device=device) # calculate loss loss_gen = loss(yhat, target) # calculate gradients loss_gen.backward() # update weights on G opt_G.step()

In order to see how well our Generator is doing with each passing epoch, we will be plotting a bunch of images every 10th iteration using the helper function `plot_images()`

.

#################################### #### Plot some Generator images #### #################################### # during every epoch, print images at every 10th iteration. if i% 10 == 0: # convert the fake images from (b_size, 3, 32, 32) to (b_size, 32, 32, 3) for plotting img_plot = np.transpose(fake_img.detach().cpu(), (0,2,3,1)) # .detach().cpu() is imp for copying fake_img tensor to host memory first plot_images(img_plot) print("********************") print(" Epoch %d and iteration %d " % (e, i))

Now you may notice that the dimensions in `fake_img`

are reordered using `np.transpose()`

before being passed to the plotting function `plot_images()`

. This is because, `plt.imshow()`

method (used in `plot_images()`

) requires the images passed to it to be in the form `(height, width, channels)`

However, the shape of the images outputted by the Generator takes the form `(channels, height, width)`

which is standard in PyTorch. To fix this, we must transpose the dimensions of the fake images such that we have images like `b_size, 32, 32, 3)`

.

Another thing to keep in mind is that calling `.detach().cpu()`

is important for copying `fake_img`

tensor to host memory first before we can begin to pass it to the plotting function.

This is how the final block of code for training a GAN — including Part 1, Part 2, and Part 3 — looks like:

# TRAINING GANS epochs = 1000 # going over the entire dataset 10 times for e in range(epochs): # pick each batch b of input images: shape of each batch is (32, 3, 32, 32) for i, b in enumerate(dataloader): ########################## ## Update Discriminator ## ########################## # Loss on real images # clear the gradient opt_D.zero_grad() # set the gradients to 0 at start of each loop because gradients are accumulated on subsequent backward passes # compute the D model output yhat = netD(b.to(device)).view(-1) # view(-1) reshapes a 4-d tensor of shape (2,1,1,1) to 1-d tensor with 2 values only # specify target labels or true labels target = torch.ones(len(b), dtype=torch.float, device=device) # calculate loss loss_real = loss(yhat, target) # calculate gradients - or rather accumulation of gradients on loss tensor loss_real.backward() # Loss on fake images # generate batch of fake images using G # Step1: creating noise to be fed as input to G noise = torch.randn(len(b), 100, 1, 1, device = device) # Step 2: feed noise to G to create a fake img (this will be reused when updating G) fake_img = netG(noise) # compute D model output on fake images yhat = netD.cuda()(fake_img.detach()).view(-1) # .cuda() is essential because our input i.e. fake_img is on gpu but model isnt (runtimeError thrown); detach is imp: Basically, only track steps on your generator optimizer when training the generator, NOT the discriminator. # specify target labels target = torch.zeros(len(b), dtype=torch.float, device=device) # calculate loss loss_fake = loss(yhat, target) # calculate gradients loss_fake.backward() # total error on D loss_disc = loss_real + loss_fake # Update weights of D opt_D.step() ########################## #### Update Generator #### ########################## # clear gradient opt_G.zero_grad() # pass fake image through D yhat = netD.cuda()(fake_img).view(-1) # specify target variables - remember G wants D *to think* these are real images so label is 1 target = torch.ones(len(b), dtype=torch.float, device=device) # calculate loss loss_gen = loss(yhat, target) # calculate gradients loss_gen.backward() # update weights on G opt_G.step() #################################### #### Plot some Generator images #### #################################### # during every epoch, print images at every 10th iteration. if i% 10 == 0: # convert the fake images from (b_size, 3, 32, 32) to (b_size, 32, 32, 3) for plotting img_plot = np.transpose(fake_img.detach().cpu(), (0,2,3,1)) # .detach().cpu() is imp for copying fake_img tensor to host memory first plot_images(img_plot) print("********************") print(" Epoch %d and iteration %d " % (e, i))

And there we have it, we have implemented a vanilla GAN from scratch using our custom image dataset! Wohoooo…

To give you a rough estimate of the quality of the images generated by our GAN:

Epoch 0 (iteration 160th): Nice to see that the generator is picking up the fact that faces exist in the center of the image.

While the code I have shared with you in the Github Notebook is error-free, I would like to take a moment and discuss a few runtime errors that I encountered while learning to train GANs from scratch.

*Input type (torch.cuda.DoubleTensor) and weight type (torch.cuda.FloatTensor) should be the same*

Here, weight type usually refers to the weights in your model which we explicitly set to type `float32`

if you recall. The reason you might be seeing this error is that you are feeding something to your model that is probably `float64`

instead of `float32`

, i.e. a type mismatch problem.

In my case, I came across this error when I tried to pass a batch of images via dataloader to the Discriminator model *without* first explicitly converting them to float using `np.float32`

.

*Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same*

The error itself is self-explanatory if you carefully observe the input type (i.e. *torch.FloatTensor) *and weight type (i.e. *torch.cuda.FloatTensor) *— only one of them contains the word ‘*cuda*’. What it means is that your model is on the GPU whereas the input data is still on the CPU. To rectify this error, simply send your input data tensor on the GPU using `.to(device)`

.

I encountered this error during Part 1 of GAN training when I was computing the model outputs for a batch of input images using `yhat = netD(b).view(-1)`

(for calculating discriminator loss on real images). The fix is simple: `yhat = netD(b.to(device)).view(-1)`

.

Congrats on making it this far. Hopefully, this tutorial (along with Part 1) was a warm intro to a super useful yet super complex deep learning concept that GANs are known to be.

Until next time

**Podurama** is the best podcast player to stream more than a million shows and 30 million episodes. It provides the best recommendations based on your interests and listening history. Available for iOS Android MacOS Windows 10 and web. Early users get free lifetime sync between unlimited devices.

*This article was originally published on Towards AI and re-published to TOPBOTS with permission from the author.*

We’ll let you know when we release more technical education.

The post Step-By-Step Implementation of GANs on Custom Image Data in PyTorch: Part 2 appeared first on TOPBOTS.

]]>The post How I Would Explain GANs From Scratch to a 5-Year Old: Part 1 appeared first on TOPBOTS.

]]>To put it simply, GANs let us generate incredibly realistic data (based on some existing data). Be it human faces, songs, Simpsons characters, textual descriptions, essay summaries, movie posters — GANs got it all covered! At Podurama, we are currently using them for high-resolution thumbnail synthesis.

**If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material. **

To generate realistic images, GANs must know (or more specifically *learn*) the underlying distribution of data.

**What does that even mean?**

It means we must feed it samples (images) from a specific distribution (of all cats or all human faces or all digits) such that if it is asked to generate images of cats, it must somehow be aware that a cat has four legs, a tail, and some whiskers. Or, if it is asked to generate digits, it must know how each digit looks like. Or, if it is asked to generate human faces, it must know that a face must contain two eyes, two ears, and a nose.

**But how does probability distribution fit in with images?**

While it is certainly easy to visualize a 1-dimensional distribution curve, say a normal histogram to plot ‘height’ distribution, or even a 2d distribution curve using a contour plot of height and weight distribution (if you are feeling smug); it is not so easy with image data.

On left, we have a 1d probability distribution curve and on right, we have a 2d normal distribution curve. [Source]

In the case of images, we are dealing with a **high-dimensional probability distribution**.

**How “high” a dimension are we talking about here?**

Generally speaking, we use 32×32 images and each of them is colored, meaning three additional channels to capture the RGB component. So, our probability distribution has 32 * 32 * 3 ≈ 3k dimensions. That is, the probability distribution will go over each pixel in all of the images. Finally, the **distribution that emerges will determine whether an image is normal or not** (more on this in about 45 seconds).

**Can I still visualize it somehow?**

To put this in some perspective (and coming back to the original topic of probability distributions for images), let’s take an example of the handwritten digits represented by features `x1`

and `x2`

. That is, we have used some sort of dimensionality reduction technique to bring down 3k dimensions to 2 dimensions only, such that we can still represent all ten digits (0–9) along these two dimensions.

As you can see, the probability distribution, in this case, has many peaks, ten to be precise, one corresponding to each of the digits. These peaks are nothing but the modes (a mode in a distribution of data is just an area with a high concentration of observations), i.e. our distribution of digits is a **multimodal distribution.**

Different images of the digit 7 are represented by similar `x1`

and `x2`

pairs where `x1`

usually tends to be on the higher side compared to `x2`

. Similarly for the digit 5, both `x1`

and `x2`

dimensions have lower values compared to that of digit 7.

Now if we have done a great job at training our GAN properly (in other words, it has learned the probability distribution correctly), then we won’t have one of GAN’s output images ending up in the space between 5 and 7, i.e. in areas of very very low density. If not, we can be quite certain that the digit produced would probably look like a love-child of digits 5 and 7 (in short, random noise) and thus, not one of the ten digits that we care about!

To ensure GANs are able to replicate a probability distribution nicely, their architecture is essentially composed of two neural networks competing against one another — a **Discriminator** and a **Generator.**

A Generator’s job is to create fake images that look real.

A Discriminator’s job is to correctly guess whether an image is fake (i.e. generated by the Generator) or real (i.e. coming directly from the input source).

Once the Generator becomes good enough at creating (fake) images that the Discriminator perceives as real (i.e. we have deceived the Discriminator), our job is done and the GAN is trained.

In terms of coding these two neural networks,

- Discriminators can be thought of as simple binary image classifiers, which take as input an image and spits out whether the image is real (output = 1) or fake (output = 0).
- Generators are somewhat more complex, in that they take as input some random numbers
*or noise*(say a vector of size 100; where the choice of 100 is arbitrary) and try to perform some computations on it during the hidden layers such that the final output is an image (or more specifically a vector of size*h***w***c*where*h*is the image height;*w*is image width and*c*is the number of channels i.e.*c*=3 for a colored RGB image).*Note: This image, although fake, must have the same dimensions as that of the real image i.e. if the size of real images in our source data is 32*32*3, then output from the generator should also be an image of size 32*32*3.*

Generally speaking (and this will come in handy when we are coding our GANs from scratch in Part 2), the aim of the discriminator is to be the best at determining fake images from the real ones. As a result, when calculating the amount of error that discriminator makes during the training phase, we must include two instances:

- Real error (or positive error): the amount of error made when discriminator is passed real images.
- Fake error (or negative error): the amount of error made when discriminator is passed fake images (created by generator).

The sum of the positive and negative error is what will be optimized during the training process.

Mathematically speaking, the discriminator’s objective is to:

*max { log D(x) + log (1- D(G(z))) }*

D: Discriminator

G: Generator

x: real image

z: noise vector

As you can see, the objective function has two parts to it, both of which need to be maximized in order to train the discriminator. Maximizing log (D(x)) takes care of the positive (or real) error whereas maximizing log (D(G(z))) takes care of the negative (or fake) error. Let’s see how…

**Why should we maximize log (D(x)) in the above equation?**

As we mentioned earlier, a discriminator is essentially a binary classifier and thus, D(x) will generate a value between 0 and 1, establishing how real (or fake) it thinks the input image is.

Since x is a real image,

- in an ideal world (one where D is trained perfectly to recognize fakes from real), D(x) output should be ≈ 1. That means log (D(x)) will be roughly equal to 0.
- in a not-so-ideal world (where D is still learning), D(x) would output, say 0.2, meaning it is only 20% confident that the image is real. That means log (D(x)) will be -0.69.

In short, **when passed real images,** the **discriminator’s objective is to maximize log (D(x))**, increasing it from a meager -0.69 to 0 (maximum achievable value).

**Why should we maximize log (1-D(G(z))) in the above equation?**

Since z is a noise vector, passing it to the Generator G should output an image. In other words, G(z) will be an image, lets call it *fake_image:*

- in an ideal world (one where D is trained perfectly to recognize fakes from real), passing
*fake_image*to D will result in D(G(z)) ≈0. Consequently, log (1– **a-very-small-value**) will be roughly equal to 0. - in a not-so-ideal world, passing
*fake_image*to D will result in D(G(z)) ≈0.99 as D is not well trained and it*thinks*the fake image is real. Consequently, log (1–0.99) will be roughly equal to -2.

In short, **when passed real images,** the **discriminator’s objective is to maximize log (1-D(G(z)))**, increasing it from a meager -2 to 0 (maximum achievable value).

For a generator, the biggest challenge is to produce an image that is realistic enough to fool the Discriminator. In other words, calculating the amount of error that the Generator makes during the training phase can be easily figured out with the help of the Discriminator.

Speaking strictly from the Generator’s point of view, it would like the Discriminator to churn out `output = 1`

(or a very high number close to 1) when one of its fake images is given as input.

Mathematically speaking, this is precisely what a Generator’s objective is:

*max { log D(G(z)) }*

Now one might wonder, * why does the Generator’s objective function have only one term to maximize whereas Discriminator’s had two*. That’s because a Discriminator must deal with both fake and real images as inputs and so we must calculate the loss separately. However, a Generator never has to deal with real images since it never sees them (remember: a Generator’s input is some random noise, not a real image) and so no need for an additional loss term.

*Note: While I will be discussing how to build GANs (DCGANs to be more specific) from scratch in **Part 2**, right now we are going to look at what all hidden layers are at play in a Generator neural network.*

As mentioned, the discriminator, D, is a binary classification network that takes an image as input and outputs a scalar probability that the input image is real (as opposed to fake).

Here, D takes a 32*32*3 input image, processes it through a series of Conv2d, BatchNorm2d, Dropout, and LeakyReLU layers, and outputs the final probability through a Sigmoid activation function.

*P.S. Do not worry if you don’t understand what each layer does, that’s what we will be covering in **Part 2**! *

As mentioned, Generator G is a neural network that tries to produce (hopefully) realistic-looking images. To do so, it takes as input a random noise vector *z* and tries to create an RGB image with the same size as the training images i.e. 32*32*3 (see image below). To do so, it processes *z *through a series of strided Conv2d transpose layers, each paired with a 2d BatchNorm layer and a relu activation.

*Note: The spatial size of the images used here for training is 32*32*3. In case you are working with another size, the structure of both D and G must be updated.*

Regardless of whichever framework you choose to code your GAN, the steps more or less remain the same.

for each epoch: for each batch b of input images: ############################## ## Update Discriminator - D ## ############################## # loss on real images clear gradients of D pred_labels_real = pass b through D to compute outputs true_labels_real = [1,1,1....1] calculate loss(pred_labels_real, true_labels_real) calculate gradients using this loss # loss on fake images generate batch of size b of fake images (b_fake) using G pred_labels_fake = pass b_fake through D true_labels_fake = [0,0,....0] calculate loss(pred_labels_fake, true_labels_fake) calculate gradients using this loss update weights of D ############################## #### Update Generator - G #### ############################## clear gradients of G pred_labels = pass b_fake through D true_labels = [1,1,....1] calculate loss(pred_labels, true_labels) calculate gradient using this loss update weights of G ################################################ ## Optional: Plot a batch of Generator images ## ################################################

We begin with an outer loop stating how many epochs do we want our code to run for. If we are going to set `epochs = 10`

meaning the model is going to train on *all of the data* 10 times. Next, instead of working with all the images in our training set at once, we are going to draw out small batches (say of size 64) in each iteration.

Refrain from using an exceptionally large value for batch size since we do not want the Discriminator getting too good too soon (as a result of having access to too much training data in initial iterations) and overpowering the Generator.

The training process is further split up into two parts — updating Discriminator and updating Generator (and an optional third part where you throw in some random noise into the Generator (say, every 50th iteration) and check the output images to see how good it is doing).

For updating the Discriminator (or rather updating Discriminator’s weights with each epoch to minimize the loss), we pass a batch of real images to the Discriminator and generate the output. The output vector will contain values between 0 and 1. Next, we compare these predicted values to their true labels i.e. 1 (by convention, real images are labeled as 1 and fake images are labeled as 0). Once the discriminator’s loss over real images is calculated, we calculate the gradient i.e. take the derivative of the loss function with respect to the weights in the model.

Next, we pass some random noise input to the Generator and produce a batch of fake images. These images are then passed onto a Discriminator which generates the predictions (values between 0 and 1) for these fakes. Next, we compare these predicted values to their true labels i.e. 0, and compute the loss i.e. how far the predicted labels are from the true labels. Once, the loss over fake images has been calculated, derivative of the loss function is used to calculate gradients, just like in the case of real images. Finally, the weights are updated based on the gradient *(w = w + learning_rate * w.gradient) *to minimize the overall Discriminator loss.

A very similar sequence of steps is used when updating the Generator. In there, we start with passing a batch of fake images (generated during Discriminator training) to the Discriminator. Now one might wonder — *why did we pass the fake batch through the Discriminator for a second time? Didn’t we just do that during Discriminator training? *The reason is, the Discriminator D got updated before we started updating the Generator and so a forward pass of the fake batch is essential.

Next, the loss is calculated using the output from the Discriminator and the true label of the images. One important thing to note is that**, even though these images are fake, we set their true label as 1 **during loss calculations**.**

*But why, we thought 1 was reserved as a label for real images only!*

To answer this, I am going to re-iterate a line from this article itself:

Speaking strictly from the Generator’s point of view, it would like the Discriminator to churn out

`output = 1`

(or a very high number close to 1) when one of its fake images is given as input.

Because the Generator wants the Discriminator to *think *it is churning out real images, it uses the true labels as 1. This way, the loss function translates to minimizing how far D’s output for fake images is from D’s output for real images (i.e. 1).

Finally, the weights are updated based on the gradient to minimize the overall Generator loss.

The optional code for generating images when noise is fed into the Generator will be discussed in Part 2.

In a nutshell, the whole point of training a GAN network is to obtain a Generator network (with the most optimal model weights and layers, etc.) that is excellent at spewing out fakes that look real. After we do so, we are able to input a point from latent space (say a 100-dimensional Gaussian-distribution vector) into the Generator and it is only *our *Generator that knows how to convert that random noise vector into a *realistic-enough* image that looks like it could belong to our training set!

Until then

**Podurama** is the best podcast player to stream more than a million shows and 30 million episodes. It provides the best recommendations based on your interests and listening history. Available for iOS Android MacOS Windows 10 and web. Early users get free lifetime sync between unlimited devices.

*This article was originally published on Medium and re-published to TOPBOTS with permission from the author.*

We’ll let you know when we release more technical education.

The post How I Would Explain GANs From Scratch to a 5-Year Old: Part 1 appeared first on TOPBOTS.

]]>