Transformers are a very powerful Deep Learning model that has been able to become a standard in many Natural Language Processing tasks and is poised to revolutionize the field of Computer Vision as well.
It all began in 2017 when Google Brain published the paper destined to change everything, Attention Is All You Need . Researchers apply this new architecture to several Natural Language Processing problems, and immediately it’s evident how much this may be able to overcome some of the limitations that plague RNNs, traditionally used for tasks such as translating from one language to another.
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
Over the years, Transformers have become an institution in the field of Natural Language Processing and Google Brain, in 2020, asks, will they be as effective on images? The answer is yes, the Vision Transformers are born and, with some preliminary modifications to the images, they manage to exploit the classic architecture of the Transformers and soon reach the state of the art in many problems also in this field.
The excitement is great and after a few months, at the beginning of 2021, Facebook researchers published a new version of the Transformers this time, however, specifically for video, the TimeSformers. Obviously, even in this case, with some minor structural changes this architecture is soon a winner on video and Facebook announces in February 2021 that it would use it with the videos of its social to create new models for a wide variety of purposes.
Why do we need transformers?
But let’s take a step back and explore the motivations that drove Google researchers to search for a new alternative architecture to solve natural language processing tasks.
Traditionally, a task such as translation was performed using Recurrent Neural Networks, which are known to have a number of problems. One of the main problems is its sequential operation. For example, to translate a sentence from English to Italian, with this type of networks, the first word of the sentence to be translated was passed into an encoder together with an initial state, and the next state was then passed into a second encoder with the second word of the sentence, and so on until the last word. The resulting state from the last encoder is then passed to a decoder that returns as output both the first translated word and a subsequent state, which is passed to another decoder, and so on.
The problem here is quite obvious, to complete the next step, I must have the result of the previous step. This is a big flaw because you don’t take advantage of the parallelization capability of modern GPUs and thus lose out in terms of performance. There are also other problems such as gradient explosion, inability to detect dependencies between distant words in the same sentence, and so on.
Attention is all you need?
The question then arose, is there a mechanism that we can compute in a parallelized manner that allows us to extract the information we need from the sentence? The answer is yes, and that mechanism is attention.
If we were to define attention forgetting for a moment any technical and implementation aspects, how would we go about doing so?
Let’s take an example sentence and ask ourselves, focusing on the word “gave,” what other words in the sentence should I place my attention on to add meaning to this word? I might ask myself a series of questions, for example, who gave? And in this case, I would focus on the word “I,” and then I might ask To Whom gave? Placing my attention in this case on the word Charlie and finally, I might ask, what gave? Focusing finally on the word food.
By asking myself these questions and perhaps doing this for each of the words in the sentence, I might be able to understand the meaning and facets. The problem at this point is, how do I implement this concept in practice?
To understand the computation of attention we can draw parallels to the world of databases. When we do a search in the database we submit a query (Q) and we search among the available data for one or more keys that satisfy the query. The output is the value associated with the key most relevant to the query.
What happens in the case of attention computation is extremely similar.
We begin by looking at the sentence on which to compute attention as a set of vectors. Each word, via a word embedding mechanism, is encoded into a vector. We consider these vectors as the keys to search among, with respect to a query we are searching for, which could be a word from the same sentence (self-attention) or from another sentence. At this point, we need to calculate the similarity between the query and each of the available keys, mathematically via the scaled dot product. This process will return a series of real values, perhaps very different from each other, but since we want to obtain weights between 0 and 1 whose sum is equal to 1, we apply a SoftMax to the results. Once obtained the weights we must multiply the weight of each word, and therefore its relevance to the query, by the vector that represents it. We finally return the combination of these products as the attention vector.
To build this mechanism we use linear layers that, starting from the input vector, generate keys, queries and values, by matrix multiplication. The combination of keys and queries will allow obtaining the most correct matching between these two sets, whose result will then be combined with the values in order to obtain the most relevant combination.
But this mechanism would be sufficient if we wanted to focus on a single word, but what if we wanted to look at the sentence from several points of view and then calculate several times, in parallel, the attention? We use so-called multi-head attention, with a similar structure whose results are simply combined at the end to return a single, summarizing vector of all the calculated attention.
Now that we have understood which mechanism to use and made sure of its parallelizability, let’s analyze the structure within which the multi-head attention is embedded and which constitutes the transformer.
Considering always a translation task, let’s focus initially on the left part of the image, the encoding part, which takes as input the entire sentence to be translated from English to Italian. Already here we see that there is a huge revolution compared to the RNN approach because instead of processing the sentence word by word, it is submitted entirely. Before proceeding with the attention computation, the vectors representing the words are combined with a positional encoding mechanism, based on sine and cosine, which embeds in the vectors information about the position of the words in the sentence. This is very important because we know that in any language the position of the words in the sentence is more than relevant and it is information that we cannot absolutely lose if we want to make a correct evaluation. All this information passes into a multi-head attention mechanism, whose result is normalized and passed to a feed-forward. The encoding can be done N times to get more meaningful information.
But the sentence to be translated is not the only input to the transformer, we have a second block, the decoder, which takes in the output of the previous execution of the transformer. If we assume, for example, that we have already translated the first two words and we want to predict the third word of the sentence in Italian, we will pass in the decoder the first two translated words. The positional encoding and multi-head attention will be performed on these words and the result will be combined with the encoder result. The attention is recalculated on the combination and the result, by means of a linear layer and a softmax, will be a vector of potential candidate words to be the new translated word, with a probability associated with each of them. In the next iteration, the decoder will then also take in this word in addition to the previous ones.
This structure has therefore proved to be incredibly effective and performant, this is because it processes the sentence in its entirety and not word by word, retains information about the position of words in the sentence and exploits attention which is a mechanism capable of effectively expressing the content of the sentence.
After all this nice explanation you might think that transformers are perfect and without any kind of defect. Obviously, it is not so and one of its strengths is also its weakness, the calculation of attention!
In order to calculate the attention of each word with respect to all the others I have to perform N² calculations that, even if partially parallelizable, are still very expensive. With such a complexity let’s imagine what it means to calculate the attention, many times, on a paragraph of hundreds and hundreds of words.
Graphically you can imagine a matrix that has to be filled with the attention values of each word compared to any other and this clearly has a relevant cost. It is important to point out that optionally and usually on the decoder, it is possible to calculate the masked attention in which you avoid calculating the attention between the query word and all subsequent one
Some might then argue, but do we really need all that structure seen above if then many of the benefits brought by transformers are rather related to the attention mechanism? But didn’t the first Google Brain paper from 2017 says “Attention Is All You Need”?  Certainly legitimate, but in March 2021, again Google researchers published a paper titled “Attention Is Not All You Need” . What does that mean? The researchers conducted experiments analyzing the behaviour of the self-attention mechanism conducted without any of the other components of the transformers and found that it converges to a rank 1 matrix with a doubly exponential rate. This means that this mechanism, by itself, is practically useless. So why are transformers so powerful? It is due to a tug of war between the self-attention mechanism that tends to reduce the rank of the matrix and two other components of transformers, skip connections and MLP.
The first allows to diversify the distribution of paths avoiding obtaining all the same path and this drastically reduces the probability that the matrix is reduced to rank 1. The MLP instead manages to increase the rank of the resulting matrix due to its non-linearity. In contrast, it has been shown that normalization plays no role in avoiding this behaviour of the self-attention mechanism. Therefore, attention is not all you need, but the transformer architecture manages to use it to its advantage to achieve impressive results.
Arriving at this point in 2020, again Google researchers wondered, “but if Transformers have been found to be so effective in the field of Natural Language Processing, how will they perform with images?”. A bit like it was done with NLP, we start from the concept of attention but this time applied to images. Let’s try to understand it through an example.
If we consider a picture of a dog standing in front of a wall, any of us would say that it is a “picture of a dog” and not a “picture of a wall”, this is because we are focusing our attention on the dominant and discriminating subject of the image and this is exactly what the mechanism of attention applied to images does.
Now that we understand that the concept of attention can be extended to images as well, we just have to find a way to input images to a classic transformer.
We know that the transformer takes as input vectors, those of words, so how can we convert an image into vectors? Surely a first solution would be to use all the pixels of the image and put them “inline” to obtain a vector. But let’s stop for a moment and see what would happen if we chose this option.
We previously said that the calculation of attention has a complexity equal to O(N²) this means that if we have to calculate the complexity of each pixel with respect to all others, in a low-resolution image like 256×256 pixels we would have an extreme amount of calculations and absolutely insurmountable with today’s resources. So this approach is certainly not viable.
The solution is quite simple and in the paper “An image is worth 16×16 words”  is proposed to divide the image into patches and then convert each patch into a vector using a linear projection that will map the patches in a vector space.
Now we just have to go and see the architecture of the Vision Transformer.
The image is then divided into patches, which pass through a linear projection to obtain vectors, which are coupled with information about the position of the patch within the image and submitted to a classic transformer. The addition of information about the original position of the patch inside the image is fundamental because during the linear projection this information would be lost even if it is very important to fully understand the content of the image. A further vector is inserted which is independent of the image being analyzed and which is used to obtain global information about the entire image and in fact, the output corresponding to this patch is the only one that is considered and passed into an MLP which will return the predicted class.
However, there is a point in this process where there is a very significant loss of information. In fact, in the transition from patch to vector, any kind of information about the position of pixels in the patch is lost. This is certainly a serious thing, the authors of Transformer in Transformer (TnT)  point out because the arrangement of pixels within a portion of the image to be analyzed is certain information we would not want to lose in order to make a quality prediction.
The authors of TnT then asked themselves, is it possible to find a better way to get the vectors to submit to the transformer?
Their proposal is then to take each individual patch (pxp) of the image, which are themselves images on 3 RGB channels, and transform it into a c-channel tensor. This tensor is then divided into p’ parts with p’<p, in the example p’=4. This yields p’ vectors in c dimensions. These vectors now contain information about the arrangement of pixels within the patch.
They are then concatenated and linearly projected in order to make them the same size as the vector obtained from the linear projection of the original patch and combined with it.
By doing this the input vectors to the transformer will also be affected by the arrangement of pixels within the patches and by doing this the authors have managed to further improve performance on various computer vision tasks.
Given the great successes of transformers first in NLP and then in their application to images, in 2021 Facebook researchers tried to apply this architecture to video as well.
Intuitively, it is clear that it is possible to do this since we all know that a video is nothing more than a set of frames one after the other and frames are nothing more than images.
There is only one small detail that makes them different from Vision Transformers, you have to take into account not only space but also time. In this case in fact, when we go to calculate the attention we can not look at the frames as isolated images but we should find some form of attention that takes into account the variation that occurs between consecutive frames as it is central in the evaluation of a video.
To try to solve this problem, the authors have suggested several new attention mechanisms, from those that focus exclusively on space, used primarily as a reference point, to those that compute attention axially, scattered, or jointly between space and time.
However, the method that has achieved the best results is Divided Space-Time Attention. It consists, given a frame at instant t and one of its patches as a query, to compute the spatial attention over the whole frame and then the temporal attention in the same patch of the query but in the previous and next frame.
But why does this approach work so well? The reason is that it learns more separate features than other approaches and is, therefore, better able to understand videos from different categories. We can see this in the following visualization where each video is represented by a point in space and its colour represents the category it belongs to.
The authors also questioned the relevance of the resolution of the videos and the number of frames in them and found that the higher the resolution the better the accuracy of the model, up to a point. As for the number of frames, again as the number of frames increases, the accuracy also increases. The interesting thing is that it was not possible to make tests with a higher number of frames than that shown in the graph and therefore potentially the accuracy could still improve, we have not yet found the upper limit of this improvement.
In Vision Transformers it is known that a larger training dataset often results in better accuracy. This was also checked by the authors on TimeSformers and again, as the number of training videos considered increases, the accuracy also increases.
What is left to do now? Transformers have just landed in the world of computer vision and seem to be more than determined to replace traditional convolutional networks or at least carve out an important role for themselves in this area. The scientific community is therefore in turmoil to try to further improve Transformers, combine them with various techniques and apply them to real problems, finally being able to do things that were not possible until recently. Big giants like Facebook and Google are actively working to develop and apply Transformers and we have probably only scratched the surface yet.
References and insights
 ”Gedas Bertasius, Heng Wang, and Lorenzo Torresani”. ”Is Space-Time Attention All You Need for Video Understanding?”.
 ”Alexey Dosovitskiy et al.”. ”An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”.
 ”Kai Han et al.”. ”Transformer in Transformer”.
 ”Ashish Vaswani et al.”. ”Attention Is All You Need”.
 ”Qizhe Xie et al.”. ”Self-training with Noisy Student improves ImageNet classification”.
 “Yihe Dong et al.”, “Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth”
 “Nicola Messina et al.”, “Transformer Reasoning Network for Image-Text Matching and Retrieval”
 “Nicola Messina et al.”, “Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders”
 “Davide Coccomini”, “TimeSformer for video classification with training code”
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.