Transformers in Computer Vision

Transformer architecture has achieved state-of-the-art results in many NLP (Natural Language Processing) tasks. One of the main breakthroughs with the Transformer model could be the powerful GPT-3 released in the middle of the year, which has been awarded Best Paper at NeurIPS2020.

In Computer Vision, CNNs have become the dominant models for vision tasks since 2012. There is an increasing convergence of computer vision and NLP with much more efficient class of architectures.

Using Transformers for vision tasks became a new research direction for the sake of reducing architecture complexity, and exploring scalability and training efficiency.

The following are a couple of well known projects in this research area:

End-to-End Object Detection with Transformers (DETR), uses transformer for object detection and segmentation
An Image is Worth 16X16 Words: Transformers for Image Recognition at Scale (Vision Transformer), uses transformer for image classification
Generative Pretraining from Pixels (Image GPT), uses transformer for pixel level image completion, just like other GPT for text completion
End-to-end Lane Shape Prediction with Transformers, uses transformer for lane marking detection in autonomous driving

If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.

Architecture

Overall, there are 2 major model architectures in the related work of adopting transformer in CV. One is pure transformer architecture, the other is the hybrid architecture which combines the CNNs/backbone and the Transformer.

Pure Transformer
Hybrid: (CNNs+ Transformer)

Vision Transformer is the full self attention based Transformer architecture without CNNs and can be used out of the box, while DETR is an example of using the hybrid model architecture, which combines the convolutional neural network (CNNs) with Transformer.

Questions:

Why use Transformer in CV? And how?
How are the benchmark results?
What are the constraints and challenges of using Transformer in CV?
Which architecture is more efficient and flexible? And why?

You will find the experiments and answers in the following deep dive of ViT (vision transformer), DETR (detection transformer), and Image GPT.

If you are new to Transformer architecture, recommend you to check out this illustrated transformer and attention is all you need paper.

Vision Transformer

Vision Transformer (ViT) can achieve excellent results with pure transformer architecture applied directly to a sequence of image patches for classification tasks.

It also outperforms the state-of-the-art convolutional networks on many image classification tasks while requiring substantially fewer computational resources (at least 4 times fewer than SOTA CNN) to pre-train.

Vision Transformer model architecture in action, gif from Goo g le AI blog

Sequence of image patches

How they feed an image into the Transformer is by splitting it into fixed-size patches, and feed the linear projections of these patches along with their image position into Transformer. Then the rest of the pipeline is a clean and standard encoder and decoder blocks of the transformer.

Position embeddings are added to the image patch embeddings to retain spatial/positional information in a global scope with different strategies. In the paper, they tried different ways to encode the spatial information, including no positional information, 1D/2D positional embeddings, and relative positional embeddings.

Comparison of different positional embeddings strategies

One of the interesting findings is 2D positional embeddings did not bring significant performance gains when compared with 1D positional embeddings.

Dataset

The model is pre-trained from multiple large scale datasets with deduplication to support fine tuning (smaller dataset) downstream tasks.

ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images
ImageNet-21k with 21k classes and 14M images
JFT with 18k classes and 303M high-resolution images

Model Variants:

Like other popular Transformer models (GPT, BERT, RoBERTa), the ViT (vision transformer) comes with different model sizes (Base, large, and huge) and different number of transformer layers and heads. For example, ViT-L/16, can be interpreted as a large (24 layers) ViT model with 16 x 16 input image patch size.

Note, the smaller the input patch size yields larger computational model, simply because the input number of patches N = HW/P*P, where (H,W) is the resolution of the original image and P is the resolution of the patch image. Which means patch size 14 x 14 is more computationally expensive than the image patch of 16 x 16.

Benchmark Results:

The above results show the large Vision Transformer model beats previous SOTA on multiple popular benchmark datasets.

The vision transformer (ViT-H/14, ViT-L/16) pre-trained on JFT-300M dataset outperforms the ResNet model (ResNet152x4, which is pre-trained on same JFT-300M dataset) on all testing dataset while taking substantially less computational resources (TPUv3 core days) during pre-training. Even the ViT pre-trained on ImageNet-21K outperforms the baseline.

Performance vs Dataset Size

Pre-training dataset size VS model performance

The above graph shows the impact of dataset size on model performance. ViT did not perform well when the size of pre-training dataset is small, it outperforms previous SOTA with sufficient training data.

Which architecture is more efficient?

Like mentioned in the beginning, there are different architecture designs of using transformer for computer vision, some totally replace CNNs with transformer (ViT), some partially replace, and some combine both CNNs and transformer (DETR). The following results show the performance of each model architecture under the same computational budget.

Performance VS computational cost for different model architecture

The above experiment revealed that:

The pure transformer architecture (ViT) is more efficient and scalable than traditional CNNs (ResNet BiT) at both smaller and larger compute scales.
The hybrid architecture (CNNs + Transformer) performs better than pure transformer in smaller model size, and gets very close when the model is bigger.

Highlights of ViT (vision transformer):

Uses transformer architecture (pure or hybrid)
Input images are flattened from multiple patches
Beats state of the art on multiple image recognition benchmarks
Much cheaper to pre-train on large dataset
More scalable and computational efficient

DETR

Detection Transformer (DETR) is the first object detection framework that successfully used Transformer as the main building blocks in the pipeline.

It matches the performance of the previous SOTA methods (highly optimized Faster R-CNN) with a much simpler and flexible pipeline.

transformers in computer vision — DETR combines CNN and Transformer in the pipeline for object detection, image from Facebook AI blog

The above shows DETR, a hybrid pipeline that uses CNN and Transformer as the main building blocks in the pipeline. Here is the flow:

CNN is used to learn 2D representation of an image and extract the features.
The output of the CNN is flattened and supplemented with positional encodings to feed into standard Transformer’s encoders.
The Transformer’s decoder passes the output embeddings to a feed forward network (FNN) for predicting the class and bounding box.

Simpler Pipeline

Traditional object detection pipeline compared with DETR, image from Facebook AI blog

With the traditional object detection methods, like Faster R-CNN, there are multiple steps to do anchor generation and removing duplicates with non-maximum suppression (NMS) procedures. DETR dropped these hand-designed components to significantly streamline the object detection pipeline.

Astonishing results when extended for Panoptic Segmentation

In the paper, they further extended the DETR pipeline for panoptic segmentation task, a recently popular and challenging pixel level recognition task.

To simply explain the panoptic segmentation task, it unifies 2 distinct tasks, one is traditional semantic segmentation (assign class label for each pixel), the other is instance segmentation (detect and segment each object instance). What a smart idea to use one model architecture to solve 2 tasks (classification and segmentation).

Pixel level panoptic segmentation, image from DETR paper

The above graph shows an example of the panoptic segmentation. With the unified pipeline of DETR, it outperformed the competitive baselines.

Visualizing Attention

The following graph visualizes the Transformer’s decoder attention for the predicted objects. Attention scores are represented with different colors for different objects.

By looking at the colors/attention, you will be amazed by the model’s capability of resolving those overlapping bounding boxes with global understanding of the image through self attention. Especially with the example of the orange color zebra legs, how they have been interpreted/classified even though they are heavily overlapped with the blue and green ones locally.

Visualizing decoder attention for predicted object, image from DETR paper

Highlights of DETR:

Much simpler and flexible pipeline with Transformer
Matches previous state of the art on object detection task
More efficient with directly output the final set of predictions in parallel
Unified architecture for object detection and segmentation
Performs significantly better on large objects detection, but worse on small objects

Image GPT

Image GPT is a GPT-2 transformer based model that has been trained on pixel sequence to generate image completion and samples. Like a general pre-trained language model, it is designed to learn high-quality unsupervised image representations. It can predict the next pixel auto-regressively without any knowledge of the 2D structure of the input image.

Features from the pre-trained image GPT achieved state-of-the-art performance on a number of classification benchmark and near state-of-the-art unsupervised accuracy on ImageNet.

The following image shows the model generated completion with human provided half image as input, followed by the creative completions from the model.

Highlights of Image GPT:

Use same transformer architecture as GPT-2 in natural language text
Unsupervised learning without human labeling
Need more compute to generate competitive representations
Learned features achieved SOTA performance on classification benchmark with low resolution dataset

Summary:

Transformer’s great success in NLP has been explored in the computer vision domain and became a new research direction.

Transformer is proved to be a simple and scalable framework for computer vision tasks like image recognition, classification, and segmentation, or just learning the global image representations.
It demonstrated significant advantage in training efficiency when compared with traditional methods.
In terms of architecture, it can be used in a pure Transformer manner or in a hybrid manner by combining with CNNs.
It also faces challenges, like low performance on detecting small objects in DETR and also did not perform well when the pre-training dataset is small in Vision Transformer (ViT).
Transformer is becoming a more general framework for learning sequential data, including text, image, and time-series data.

This is just an early glimpse, looking forward to seeing the new emerging things with the increasing convergence of the NLP and CV.

Reference:

Vision Transformer (code and models)
DETR (code and models)
Image Transformer
Image GPT
GPT-3, and OpenAI API
Illustrated Transformer (good materials to learn transformer)
Attention is all you need (origination of transformer, must read)
Transformer-XL: Unleashing the Potential of Attention Models
Rethinking Attention with Performers
Longformer: The Long-Document Transformer
Reformer: The Efficient Transformer

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more computer vision updates.

We’ll let you know when we release more technical education.