If We Want Machines to Think, We Need to Teach Them to See — Fei-Fei Li
Throughout this article, I will discuss some of the more complex aspects of convolutional neural networks and how they related to specific tasks such as object detection and facial recognition.
The topics that will be discussed in this tutorial are:
- CNN review
- Receptive Fields and Dilated Convolutions
- Saliency Maps
- Transposed Convolutions
- Classic Networks
- Residual networks
- Transfer Learning
This article is a natural extension to my article titled: Simple Introductions to Neural Networks. I recommend looking at this before tackling the rest of this article if you are not well-versed in the idea and function of convolutional neural networks.
Due to the excessive length of the original article, I have decided to leave out several topics related to object detection and facial recognition systems, as well as some of the more esoteric network architectures and practices currently being trialed in the research literature. I will likely discuss these in a future article related more specifically to the application of deep learning for computer vision.
The code associated with this article can be found on my convolutional neural network GitHub repository.
If this in-depth educational content on convolutional neural networks is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.
In my original article, I discussed the motivation behind why fully connected networks are insufficient for the task of image analysis. The unique aspects of CNN’s are as follows:
- Fewer parameters (weights and biases) than a fully connected network.
- Invariant to object translation — they do not depend on where the feature occurs in the image.
- Can tolerate some distortion in the images.
- Capable of generalizing and learning features.
- Requires grid input.
Convolutional layers are formed by filters, feature maps, and activation functions. These convolutional layers can be full, same or valid.
We can determine the number of output layers of a given convolutional block if the number of layers in the input is known, nᵢ, the number of filters in that stage, f, the size of the stride, s, and the pixel dimension of the image, p(assuming it is square).
Pooling layers are used to reduce overfitting. Fully connected layers are used to mix spacial and channel features together. Each of the filter layers corresponds to the image after a feature map has been drawn across the image, which is how features are extracted.
It is important to know the number of input and output layers as this determines the number of weights and biases that make up the parameters of the neural network. The more parameters in the network, the more parameters need to be trained which results in longer training time. Training time is very important for deep learning as it a limiting factor unless you have access to powerful computing resources such as a computing cluster.
Below is an example network for which we will calculate the total number of parameters.
In this network, we have 250 weights on the convolutional filter and 10 bias terms. We have no weights on the max-pooling layer. We have 13 × 13 × 10 = 1,690 output elements after the max-pooling layer. We have a 200 node fully connected layer, which results in a total of 1, 690 × 200 = 338, 000 weights and 200 bias terms in the fully connected layer. Thus, we have a total of 338,460 parameters to be trained in the network. We can see that the majority of the trained parameters occur at the fully connected output layer.
Each CNN layer learns filters of increasing complexity. The first layers learn basic feature detection filters such as edges and corners. The middle layers learn filters that detect parts of objects — for faces, they might learn to respond to eyes and noses. The last layers have higher representations: they learn to recognize full objects, in different shapes and positions.
For those of you who need a more visceral feel to understand the convolutional neural network before continuing, it may be helpful to look at this three-dimensional representation:
In the next section, we will discuss the concept of receptive fields of a convolutional layer in more detail.
Receptive Field and Dilated Convolutions
The receptive field is defined as the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by). Applying a convolution C with kernel size k = 3×3, padding size p = 1 × 1, and stride s = 2 × 2 on a 5 × 5 input map, we will get a 3 × 3 output feature map (green map).
Applying the same convolution on top of the 3 × 3 feature map, we will get a 2 × 2 feature map (orange map).
Let’s look at the receptive field again in one-dimension, with no padding, a stride of 1 and a kernel of size 3 × 1.
We can skip some of these connections in order to create a dilated convolution, as shown below.
This dilated convolution works in a similar way to a normal convolution, the major difference being that the receptive field no longer consists of contiguous pixels, but of individual pixels separated by other pixels. The way in which a dilated convolutional layer is applied to an image is shown in the figure below.
The below figure shows dilated convolution on two-dimensional data. The red dots are the inputs to a filter which is 3 × 3, and the green area is the receptive field captured by each of these inputs. The receptive field is the implicit area captured on the initial input by each input (unit) to the next layer.
The motivation behind using dilated convolutions are:
- The detection of fine details by processing inputs in higher resolutions.
- A broader view of the input to capture more contextual information.
- Faster run-time with fewer parameters
In the next section, we will discuss using saliency maps to examine the performance of convolutional networks.
Saliency maps are a useful technique that data scientists can use to examine convolutional networks. They can be used to study the activation patterns of neurons to see which particular sections of an image are important for a particular feature.
Let’s imagine that you are given an image of a dog and asked to classify it. This is pretty simple for a human to do, however, a deep learning network might not be as smart as you, and might instead classify it as a cat or a lion. Why does it do this?
The two main reasons why the network may misclassify the image:
- bias in training data
- no regularization
We want to understand what made the network give a certain class as output — one way of doing this is to use saliency maps. Saliency maps are a way to measure the spatial support of a particular class in a given image.
“Find me pixels responsible for the class C having score S(C) when the image I is passed through my network”.
How do we do that? We differentiate! For any function f(x, y, z), we can find the impact of variables x, y, z on fat any specific point (x₁, y₁, z₁) by finding its partial derivative with respect to these variables at that point. Similarly, to find the responsible pixels, we take the score function S, for class C and take the partial derivatives with respect to every pixel.
This is fairly difficult to implement by yourself, but fortunately, auto-grad can do this! The procedure works as follows:
- Forward pass the image through the network.
- Calculate the scores for every class.
- Enforce derivative of score S at last layer for all classes except class C to be 0. For C, set it to 1.
- Backpropagate this derivative through the network.
- Render them and you have your saliency map.
Note: On step #2, instead of doing softmax, we turn it to binary classification and use the probabilities.
Here are some examples of saliency maps.
What do we do with color images? Take the saliency map for each channel and either take the max, average, or use all 3 channels.
Two good papers outlining the functioning of saliency maps are:
- Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
- Attention-based Extraction of Structured Information from Street View Imagery
There is a GitHub repository associated with this article in which I show how to generate saliency maps (the repository can be found here). Here is a snippet of the code from the Jupyter notebook:
from vis.visualization import visualize_saliency from vis.utils import utils from keras import activations # Utility to search for layer index by name. # Alternatively we can specify this as -1 since it corresponds to the last layer. layer_idx = utils.find_layer_idx(model, 'preds') plt.rcParams["figure.figsize"] = (5,5) from vis.visualization import visualize_cam import warnings warnings.filterwarnings('ignore') # This corresponds to the Dense linear layer. for class_idx in np.arange(10): indices = np.where(test_labels[:, class_idx] == 1.) idx = indices f, ax = plt.subplots(1, 4) ax.imshow(test_images[idx][..., 0]) for i, modifier in enumerate([None, 'guided', 'relu']): grads = visualize_cam(model, layer_idx, filter_indices=class_idx, seed_input=test_images[idx], backprop_modifier=modifier) if modifier is None: modifier = 'vanilla' ax[i+1].set_title(modifier) ax[i+1].imshow(grads, cmap='jet')
This code results in the following saliency maps being generated (assuming that the relevant libraries vis.utils and vis.visualization are installed). Please see the notebook if you want a fuller walkthrough of the implementation.
In the next section, we will discuss the idea of upsampling through the use of transposed convolutions.
So far, the convolutions we have looked at either maintain the size of their input or make it smaller. We can use the same technique to make the input tensor larger. This process is called upsampling. When we do it inside of a convolution step, it is called transposed convolution or fractional striding.
Note: Some authors call upsampling while convolving deconvolution, but that name is already taken by a different idea outlined in this paper.
To illustrate how the transposed convolution works, we will look at some illustrated examples of convolutions.
The first is an example of a typical convolutional layer with no padding, acting on an image of size 5 × 5. After the convolution, we end up with a 3 × 3 image.
Now we look at a convolutional layer with a padding of 1. The original image is 5 × 5, and the output image after the convolution is also 5 × 5.
Now we look at a convolutional layer with a padding of 2. The original image is 3× 3, and the output image after the convolution is also 5 × 5.
When used in Keras, such as in the development of a variational autoencoder, these are implemented using an upsampling layer. Hopefully, if you have seen this before, it now makes sense as to how these convolution layers are able to increase the size of the image through the use of transposed convolutions.
In the next section, we will discuss the architectures of some of the classic networks. Each of these networks was revolutionary in some sense in forwarding the field of deep convolutional networks.
In this section, I will go over some of the classic architectures of CNN’s. These networks were utilized in some of the seminal work done in the field of deep learning, and are often used for transfer learning purposes (this is a topic for a future article).
The first piece of research proposing something similar to a Convolutional Neural Network was authored by Kunihiko Fukushima in 1980 and was called the NeoCognitron1, who was inspired by discoveries of the visual cortex of mammals. Fukushima applied the NeoCognitron to hand-written character recognition.
By the end of the 1980’s, several papers were produced that considerably advanced the field. The idea of backpropagation was first published in French by Yann LeCun in 1985 (which was independently discovered by other researchers as well), followed shortly by TDNN by Waiber et al. in 1989 — the development of a convolutional-like network trained with backpropagation. One of the first applications was by LeCun et al. in 1989, using backpropagation applied to handwritten zip code recognition.
The formulation of LeNet-5 is a bit outdated in comparison to current practices. This is one of the first neural architectures that was developed during the nascent phase of deep learning at the end of the 20th century.
In November 1998, LeCun published one of his most recognized papers describing a “modern” CNN architecture for document recognition, called LeNet1. This was not his first iteration, this was, in fact, LeNet-5, but this paper is the commonly cited publication when talking about LeNet.
It uses convolutional networks followed by pooling layers and finishes with fully connected layers. The network first starts with high dimensional features and reduces its size while increasing the number of channels. There are around 60,000 parameters in this network.
The AlexNet architecture is one of the most important architectures in deep learning, with more than 25,000 citations — this is practically unheard of in research literature. Developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto in 2012, AlexNet destroyed the competition in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
The network was trained on the ImageNet dataset, a collection of 1.2 million high-resolution (227x227x3) images consisting of 1000 different classes, using data augmentation. The depth of the model was larger than any other network at the time, and was trained using GPU’s for 5–6 days. The network consists of 12 layers and utilized dropout and smart optimizer layers and was one of the first networks to implement the ReLU activation function, which is still widely used today. The network had more than 60 million parameters to optimize (~255 MB).
This network almost single-handedly kickstarted the AI revolution by showing the impressive performance and potential benefits of CNN’s. The network won the ImageNet contest with a top-5 error of 15.3%, more than 10.8 percentage points lower than the next runner-up.
We will be discussing the remaining networks that have won the ILSVRC, since most of these are the revolutionary networks at the forefront of research in deep learning.
This network was introduced by Matthew Zeiler and Rob Fergus from New York University, which won ILSVRC 2013 with an 11.2% error rate. The network decreased the sizes of filters and was trained for 12 days.
The paper presented a visualization technique named “deconvolutional network”, which helps to examine different feature activations and their relation to the input space.
VGG16 and VGG19
The VGG network was introduced by Simonyan and Zisserman (Oxford) in 2014. This network is revolutionary in its inherent simplicity and its structure. It consists of 16 or 19 layers (hence the name) with a total of 138 million parameters (522 MB) and uses 3×3 convolutional filters exclusively using same padding and a stride of 1, and 2×2 max-pooling layers with a stride of 2.
The authors showed that two 3×3 filters have an effective receptive field of 5×5 and that as spatial size decreases, the depth increases. The network was trained for two to three weeks and is still used to this today — mainly for transfer learning. The network was originally developed for the ImageNet Challenge in 2014.
- ImageNet Challenge 2014; 16 or 19 layers
- 138 million parameters (522 MB).
- Convolutional layers use ‘same’ padding and stride s = 1.
- Max-pooling layers use a filter size f = 2 and stride s = 2.
The GoogLeNet network was introduced by Szegedy et al. (Google) in 2014. The network was the winner of ILSVRC 2014, beating the VGG architecture. The network introduces the concept of the inception module — parallel convolutional layers with different filter sizes.
The idea here is that we do not a priori know which filter size is best, so we just let the network decide. The inception network is formed by concatenating other inception modules. It includes several softmax output units to enforce regularization. This was a key idea which has been important in the development of future architectures.
Another interesting feature is that there is no fully connected layer at the end, and this is instead replaced with an average-pooling layer. The removal of this fully connected layer results in a network with 12x fewer parameters than AlexNet, making it much faster to train.
The first residual network was presented by He et al. (Microsoft) in 2015. This network won ILSVRC 2015 in multiple categories. The main idea behind this network is the residual block. The network allows for the development of extremely deep neural networks, which can contain 100 layers or more.
This is revolutionary since up to this point, the development of deep neural networks was inhibited by the vanishing gradient problem, which occurs when propagating and multiplying small gradients across a large number of layers.
The authors believe that it is easier to optimize residual mapping than an archetypal neural architecture. Furthermore, residual block can decide to “shut itself down” if needed. Let’s compare the network structure for a plain network and a residual network. The plain network structure is as follows:
A residual network structure looks like this:
The equations describing this network are:
With this extra connection, gradients can travel backward more easily. It becomes a flexible block that can expand the capacity of the network, or simply transform into an identity function that would not affect training.
A residual network stacks residual blocks sequentially.
The idea is to allow the network to become deeper without increasing the training complexity.
Residual networks implement blocks with convolutional layers that use ‘same’ padding option (even when max-pooling). This allows the block to learn the identity function.
The designer may want to reduce the size of features and use ‘valid’ padding. — In such a case, the shortcut path can implement a new set of convolutional layers that reduces the size appropriately.
These networks can get huge and extremely complicated, and their diagrams begin to look akin to those that describe the functioning of a power plant. Here is an example of such a network.
Comparing the error values for the previous winners of ImageNet to those of the ResNet formulations, we can see a clear enhancement in the performance. Alexnet (2012) achieved a top-5 error of 15.3% (second place was 26.2%), followed by ZFNet (2013) achieved a top-5 error of 14.8% (visualization of features), followed by GoogLeNet (2014) with an error of 7.8%, and then ResNet (2015) which achieved accuracies below 5% for the first time.
Initially proposed by Huang et al. in 2016 as a radical extension of the ResNet philosophy. Each block uses every previous feature map as input, effectively concatenating them. These connections mean that the network has L(L+1)/ 2 direct connections, where L is the number of layers in the network. One can think of the architecture as an unrolled recurrent neural network.
Each layer adds k feature-maps of its own to this state. The growth rate regulates how much new information each layer contributes to the global state. The idea here is that we have all the previous information available at each point. Counter-intuitively, this architecture reduces the total number of parameters needed.
The network works by allowing maximum information (and gradient) flow at each layer by connecting every layer directly with every other layer. In this way, DenseNets exploit the potential of the network through feature reuse, which means there is no need to learn redundant feature maps. DenseNet layers are relatively narrow (e.g. 12 filters), and they just add a small set of new feature-maps.
The DenseNet architecture typically has superior performance to the ResNet architecture and can achieve the same or better accuracy with fewer parameters overall, and the networks are easier to train.
The network formulation may be a bit confusing at first, but it is essentially a ResNet architecture the resolution blocks are replaced by dense blocks. The dense connections have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes.
It is important to note that DenseNets do not sum the output feature maps of the layer with the incoming feature maps, they, in fact, concatenate them:
Dimensions of the feature maps remain constant within a block, but the number of filters changes between them, which is known as the growth rate, k.
Below is the full architecture of a dense network. It is fairly involved when we look at the network in its full resolution, which is why it is typically easier to visualize in an abstracted form (like we did above).
For more information on DenseNet, I recommend this article.
Summary of Networks
As we can see, over the course of just a few years, we have gone from an error rate of around 15% on the ImageNet dataset (which, if you remember, consists of 1.2 million images) to an error rate of around 3–4%. Nowadays the most state-of-the-art networks are able to get below 3% pretty consistently.
There is still quite a long way to go before we are able to obtain perfect scores for these networks, but the rate of progress is quite staggering in this past decade, and it should be apparent from this why we are currently undergoing a deep learning revolution — we have gone from the stage where humans have superior visual recognition, to a stage where these networks have superior vision (a human cannot achieve 3% on the ImageNet dataset).
This has fueled the transition of machine learning algorithms into various commercial fields that require heavy use of image analysis, such as medical imaging (examining brain scans, x-rays, mammography scans) and self-driving cars (computer vision). Image analysis is easily extended to video since this is just a rapid succession of multiple image frames every second — although this requires more computing power.
Transfer learning is an important topic, and it is definitely worthy of having an article all to itself. However, for now, I will outline the basic idea behind transfer learning so that the reader is able to do more research on it if they are interested.
How do you make an image classifier that can be trained in a few hours (minutes) on a CPU?
Normally, image classification models can take hours, days, or even weeks to train, especially if they are trained on exceptionally large networks and datasets. However, we know that companies such as Google and Microsoft have dedicated teams of data scientists that have spent years developing exceptional networks for the purpose of image classification — why not just use these networks as a starting point for your own image classification projects?
This is the idea behind transfer learning, to use pre-trained models, i.e. models with known weights, in order to apply them to a different machine learning problem. Obviously, just purely transferring the model will not be helpful, you must still train the network on your new data, but it is common to freeze the weights of the former layers as these are more generalized features that will likely be unchanged during training. You can think of this as an intelligent way of generating a pre-initialized network, as opposed to having a randomly initialized network (the default case when training a network in Keras).
Typically, smaller learning rates are used in transfer learning than in typical network training, as we are essentially tuning the network. If large learning rates are used and the early layers in the network are not frozen, transfer learning may not provide any benefit. Often, it is only the last layer or the last couple of layers that is trained in a transfer learning problem.
Transfer learning works best for problems that are fairly general and there are networks freely available online (such as image analysis) and when the user has a relatively small dataset available such that it is insufficient to train a neural network — this is a fairly common problem.
To summarize the main idea: earlier layers of a network learn low-level features, which can be adapted to new domains by changing weights at later and fully-connected layers.
An example of this would be to use ImageNet trained with any sophisticated huge network, and then to retrain the network on a few thousand hotdog images and you get.
The steps involved in transfer learning are as follows:
- Get existing network weights
- Unfreeze the “head” fully connected layers and train on your new images
- Unfreeze the latest convolutional layers and train at a very low learning rate starting with the weights from the previously trained weights. This will change the latest layer convolutional weights without triggering large gradient updates which would have occurred had we not done #2.
For more information, I recommend the article How HBO’s Silicon Valley built “Not Hotdog” with mobile TensorFlow, Keras & React Native.
Congratulations on making it to the end of this article! This was a long article that touched on multiple facets of deep learning. The reader should now be fairly well equipped to venture into deep convolutional learning and computer vision literature. I encourage the reader to do more individual research on the topics that I have discussed here so that they can deepen their knowledge.
I have added links to some further reading in the next section, as well as some of the references to research articles that I borrowed images from during this article.
Thanks for reading and happy deep learning!
- MobileNetV2 (https://arxiv.org/abs/1801.04381 )
- Inception-Resnet, v1 and v2 (https://arxiv.org/abs/1602.07261)
- Wide-Resnet (https://arxiv.org/abs/1605.07146)
- Xception (https://arxiv.org/abs/1610.02357)
- ResNeXt (https://arxiv.org/pdf/1611.05431)
- ShuffleNet, v1 and v2 (https://arxiv.org/abs/1707.01083)
- Squeeze and Excitation Nets (https://arxiv.org/abs/1709.01507)
- Original DenseNet paper (https://arxiv.org/pdf/1608.06993v3.pdf)
- DenseNet Semantic Segmentation (https://arxiv.org/pdf/1611.09326v2.pdf)
- DenseNet for Optical flow (https://arxiv.org/pdf/1707.06316v1.pdf)
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012
Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014.
Min Lin, Qiang Chen, and Shuicheng Yan, “Network in network,” 2013.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
Schroff, Florian, Dmitry Kalenichenko, and James Philbin. ”Facenet: A unified embedding for face recognition and clustering.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. 2015
Long, J., Shelhamer, E., & Darrell, T. (2014). Fully Convolutional Networks for Semantic Segmentation. Retrieved from http://arxiv.org/abs/1411.4038v1
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, 1–14. Retrieved from http://arxiv.org/abs/1412.7062
Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. Iclr, 1–9. http://doi.org/10.16373/j.cnki.ahr.150049
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio, 1–15. Retrieved from http://arxiv.org/abs/1609.03499
Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. van den, Graves, A., & Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. Arxiv, 1–11. Retrieved from http://arxiv.org/abs/1610.10099
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more computer vision updates.
We’ll let you know when we release more technical education.