The field of Computer Vision has for years been dominated by Convolutional Neural Networks (CNNs). Through the use of filters, these networks are able to generate simplified versions of the input image by creating feature maps that highlight the most relevant parts. These features are then used by a multi-layer perceptron to perform the desired classification.
But recently this field has been incredibly revolutionized by the architecture of Vision Transformers (ViT), which through the mechanism of self-attention has proven to obtain excellent results on many tasks.
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
In this article some basic aspects of Vision Transformers will be taken for granted, if you want to go deeper into the subject I suggest you read my previous overview of the architecture.
Although Transformers have proven to be excellent replacements for CNNs, there is an important constraint that makes their application rather challenging, the need for large datasets. In fact, CNNs are able to learn even in the presence of a reasonably small amount of data, mainly thanks to the presence of inductive biases [1, 8]. These are like suggestions that allow models to learn more quickly and generalize better. In particular, CNNs have two biases that are directly intrinsic to the very functioning of the architecture, namely:
- The neighboring pixels in the image are related to each other;
- Different parts of the image must be processed in the same way regardless of their absolute position.
However, these biases are not present in the Transformers architecture and so they need more data to fully understand the problem but at the same time, they are able to do that in a freer way. Thus, it could be said that Transformers are able to learn more but require more data while Convolutional Neural Networks achieve a lower understanding of the task addressed but also do so with smaller data moles.
But isn’t there a way to get the best out of both architectures? Lucky for us, these two architectures, which are based on two very different concepts, can be combined in many different ways to obtain something capable of exploiting the positive sides of both!
Using CNNs as patch extractors
A first possible approach consists of changing the way patches are extracted before being passed as input to the Vision Transformer. These patches are normally obtained by separating the input image into many small parts.
To understand how to go from the image to the patches via a convolutional network, it is sufficient to observe its internal functioning:
When a large image is given as input to a CNN, through the convolution layers, it is transformed from being a three-channel RGB image to an N-channel image. At the same time, its size is drastically reduced and the content of the image itself is transformed.
If at the end of the convolutional process, the N-channel image is considered as a set of N small images, we have obtained the necessary patches for the Vision Transformer. The new architecture of a possible Convolutional Vision Transformer will therefore be composed as follows:
This technique has proved particularly effective in many cases and can also be applied using pre-trained convolutional networks such as EfficientNet as a patch extractor. A possible application of this approach has been applied by myself and researchers at the CNR in Pisa to perform video deepfake detection , if you want to learn more about it click here.
From Self-Attention to Gated Positional Self-Attention (GPSA)
In order to be able to exploit convolutional networks within transformers, the intuition that self-attention layers can work as convolutional layers is exploited. We have previously pointed out that the Vision Transformers do not have inductive biases. The aim of the Facebook researchers was therefore to modify the architecture in such a way as to introduce a soft convolutional inductive bias. The new network must be able to act as a convolutional network if necessary.
To achieve this goal, the gated positional self-attention (GPSA)  was introduced, a form of positional self-attention with an additional parameter, lambda. This parameter is used to balance the layer in functioning as a convolutional layer or as a classical self-attention. During training the network will then calibrate this parameter and, if necessary, at the end of the process some of these layers will act as convolutional layers.
In addition to the GPSA layers used at occurrence to capture local information in the input, there are also classical self-attention layers forming the non-local part of the network. This architecture is called Convolutional Vision Transformer (ConViT).
CMT: Convolutional Neural Networks Meet Vision Transformers
Another recent proposal comes from Huawei’s laboratories, which introduce an even more advanced architecture than those seen so far, presenting what they call the CMT Block . Many of these blocks are used within a new architecture and mix the mechanism of self-attention with that of convolution, also introducing some performance optimizations.
Each CMT block consists of three basic parts:
- Local Perception Unit: Used to overcome the limitations introduced by classical positional embedding and the inability of classical Vision Transformers to capture local relationships and structured information within individual patches. The Local Perception Unit (LPU) extracts local information through a simple depth-wise convolution.
- Lightweight Multi-head Self-attention: To reduce the computational load in the computation of attention, through this component the spatial size of the matrices K and V are reduced using a k x k depth-wise convolution with k stride. In this way the number of self-attention calculations is reduced by dealing with smaller matrices resulting from a convolution process;
- Inverted Residual Feed-forward Network: This is the final layer of each block and replaces the classic Multi-Layer Perceptron of the Vision Transformers with an expansion layer, followed by a depth-wise convolution and a projection layer.
The resulting architecture is, therefore, able to take advantage of the best of both networks, and does so efficiently thanks to the various peculiarities introduced in the various layers.
The idea of combining convolutional networks and Vision Transformers seems not only feasible in many ways, but also incredibly effective. To date, these variants have achieved excellent results on key datasets such as ImageNet, and CMT is currently the state-of-the-art network in terms of accuracy on that dataset. As if this were not enough, the experiments carried out show that these networks are also considerably lighter and smaller than both classical approaches based exclusively on convolutional networks and those based on Vision Transformers.
Many have looked to Vision Transformers as the successor to Convolutional Neural Networks, but today it seems that a very huge power lies in the combination of these two approaches.
We can definitely say: “Unity is Strength!”.
References and Insights
 “D’Ascoli et al.”. “ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases”
 “Coccomini et al.”. “Combining EfficientNet and Vision Transformers for Video Deepfake Detection”
 “Guo et al.”. “CMT: Convolutional Neural Networks Meet Vision Transformers”
 “Davide Coccomini”. “On Transformers, Timesformers and Attention”
 “Davide Coccomini”. “On DINO, Self-Distillation with no labels”
 “Davide Coccomini”. “Is Attention what you really need in Transformers?”
 “Louis Bouchard”. “Will Transformers Replace CNNs in Computer Vision?”
 “Victor Perez”. “Transformers in Computer Vision: Farewell Convolutions!”
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.