The world of Machine Learning is undoubtedly fascinating, constantly growing, and capable of touching the most diverse sectors, from medicine to space racing, from catering to big manufacturing. There are countless fields of application for this technology and just as many techniques that have been developed over the decades, but they all have one thing in common: data.
Every Machine Learning model exists and works thanks to what it has been able, in one way or another, to learn from data. However, this data can take very different forms, for example, large amounts of text to train language models to generate sentences, understand context or irony, or identify anomalies. Or millions of images of objects, people and animals to create classification or object detection models, and even audio tracks to perform tasks such as identifying a song or its style.
All of this brings with it one big problem: dealing with such different data requires different techniques and therefore whole separate branches of Machine Learning have sprung up, each focusing on one of these data types. In particular Natural Language Processing (NLP) for linguistics, Computer Vision (CV) for images and videos, and Audio Signal Processing (ASP) for audio tracks.
If this in-depth educational content is useful for you, subscribe to our AI mailing list to be alerted when we release new material.
The problem becomes even more pronounced when solving problems that require mixing different types of data, such as figuring out which text description best fits an image or using both audio and video to identify anomalies within it.
But let’s go into more detail and try to trace the evolution of the situation from the beginning.
The Advent of Transformers
In the early past, among the dozens and dozens of Deep Learning architectures, there were two, the Long-Short Term Memory (LSTM) networks and the Convolutional Neural Networks (CNNs) which stand out.
A first approach to the analysis of different types of data took place between text and audio thanks to LSTMs. These networks are designed to effectively analyse data in the form of sequences. It was quite natural to work in the field of text by looking at sentences as word sequences but also to look at audio tracks as sequences.
At the same time, Convolutional Neural Networks were gaining ground in the Computer Vision field, which, unlike LSTMs, were better able to capture spatial correlations and were, therefore, more suitable for image manipulation by scanning the images with moving windows.
Based on very different concepts, NLP/ASP and CV worlds evolved largely independently for several years, accepting that the vision and the text/audio fields could not have a common architecture to use because of the different nature of the data.
After so many years of silence, the crucial turning point seemed to come from the NLP field where Transformers architecture was presented for the first time in 2017.
This architecture was also designed to analyze data in the form of sequences but, unlike the LSTMs, was able to overcome some important limitations:
- They were better able to capture dependencies between very distant portions of the input sequence;
- They exploit the attention mechanism that allows greater parallelization of calculations;
- They are capable of analyzing even very long sequences.
Looking at an example in the field of Natural Language Processing, Transformers analyze a sentence as a sequence composed of words by exploiting the mechanism of attention that calculates a kind of relational relevance between all possible combinations of words in the sentence. Thus, as shown in the figure, attention is calculated between the first word of the sentence and all the others, between the second and all the others, and so on.
In doing so, each part of the sequence is analyzed with respect to all the others and, since the calculations are independent, they can also be paralleled!
If you want to go deeper into the Transformers architecture I suggest you read my previous overview.
Thanks to these characteristics, in a short time, Transformers became the reference architecture in the field of Natural Language Processing, replacing almost entirely the LSTMs. Obviously, as could be expected, even in the field of Audio Signal Processing Transformers began to be used more and more, but hardly anyone would have expected that this new architecture caught the attention of researchers in the field of Computer Vision too.
If we could transform images into sequences, would Transformers be able to analyze them and capture enough spatial information to compete with traditional Convolutional Neural Networks?
The answer is yes! The idea behind the so-called Vision Transformers is to divide an image into many parts, called patches, and then project them linearly into tokens. These tokens are exactly analogous to those obtained from words and therefore, the entire remaining architecture of the Transformers can remain unchanged.
As shown in a previous article, Transformers in the field of Computer Vision, are extremely powerful due to the architectural details that allow them, compared to Convolutional Neural Networks, to better capture global relationships as well as local patterns.
It’s done, the Transformers are officially the common architecture we needed. They can manipulate text, images, video, audio and any type of data that can be turned into tokens!
Multimodal Machine Learning
Having now a single architecture capable of working with different types of data represents a major advance in the so-called Multimodal Machine Learning field.
This discipline starts from the observation of human behaviour. People are able to combine information from several sources to draw their own inferences. They simultaneously receive data by observing the world around them with their eyes, but also by smelling its scents, listening to its sounds or touching its shapes. It’s totally natural for us to work by combining pulses of different types together, but it has always been very difficult to get a neural network to do the same.
The problem lies in treating all the different inputs in the same way without losing information, and thanks to Transformers, we can now build a universal architecture that can handle any kind of data!
VATT: Transformers for Multimodal Self-Supervised Learning
One of the most important applications of Transformers in the field of Multimodal Machine Learning is certainly VATT .
This study seeks to exploit the ability of Transformers to handle different types of data to create a single model that can learn simultaneously from video, audio and text.
To do this, the proposed architecture is composed of a single Transformer Encoder on which three distinct forward calls are made. One call for each type of input data is always transformed into a sequence of tokens. The transformer takes these sequences as input and returns three distinct sets of features. Then the features are given in input to a contrastive estimation block that calculates a single loss and performs the backward.
In this way the loss is the result of the error committed on all the three types of data considered and therefore the model, between the epochs, will learn to reduce it by managing better the information coming from all the three different sources.
VATT thus represents the culmination of what Multimodal Machine Learning had been trying to achieve for years, a single model that handles completely different types of data together.
GATO: A Generalist Agent
But what impressive results can Multimodal Machine Learning research lead to? Is it possible to realize a neural network capable of receiving inputs of different types, processing them and perhaps even performing many tasks of a different nature?
What would you think if I told you that the same network with the exact same internal weights could receive different data input from very different sources and be able to play Atari, chat like a real person, caption images, stack blocks with a real robot arm and much more?
It is now possible thanks to GATO, a multi-modal, multi-task, multi-embodiment generalist that represents one of the most impressive achievements in this field today.
But how does Gato do all this? Internally, once again, there is a Transformer that takes in input data of different types transformed into a sequence of tokens.
Thanks to this unification of inputs and to the Transformer architecture, the model will be able to acquire information from even very different sources, achieving an unprecedented level of generalisation.
We took a look at one of the new frontiers of artificial intelligence, Multimodal Machine Learning, and analysed the role of Transformers in this revolution. Thanks to this new architecture capable of working with different types of input in an efficient manner, the road to a more generalist neural network is more concrete than ever. There are still many steps forward to be taken, but when work such as that discussed in this article is presented, progress is undeniable.
Could these be the first signals of a General Artificial Intelligence? We will find out!
References and Insights
 “Tadas Baltrusaitis et al.”. “Multimodal Machine Learning: A Survey and Taxonomy”
 “Ashish Vaswani et al.”.”Attention Is All You Need”
 “Hassan Akbari et al.”. “VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text”
 “Alexey Dosovitskiy et al.”. “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”
 “Scott Reed et al.”. “GATO: A generalist agent”
 “Davide Coccomini”. “On Transformers, Timesformers and Attention”
 “Davide Coccomini”. “Self-Supervised Learning in Vision Transformers”
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.