Over the years, variants of CNN architectures have been developed, leading to amazing advances in the field of deep learning. A good measure of this progress is the error rates in competitions such as the ILSVRC ImageNet challenge. In this competition, the top-5 error rate for image classification fell from over 26% to less than 3%. In this article, we will look at some of the popular CNN architectures that stood out in their approach and significantly improved on the error rates as compared to their predecessors. These are LeNet-5, AlexNet, VGG, and ResNet.
AlexNet, VGG, and ResNet are ILSVRC challenge winners in 2012, 2014 and 2015.
We will explain LeNet-5 in detail until we feel familiar with calculating network inputs/outputs showing which makes it easy to understand how a CNN works from only seeing the architecture. This also helps when trying to implement your CNN network using a low-level framework such as PyTorch and TensorFlow (other high-level frameworks will do the calculation for you).
If this in-depth educational content on convolutional neural networks is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.
LeNet-5 architecture is perhaps the most widely known CNN architecture. It was created by Yann LeCun in 1998 and widely used for written digits recognition (MNIST).
Here is the LeNet-5 architecture.
We start off with a grayscale image (LeNet-5 was trained on grayscale images), with a shape of 32×32 x1. And the goal of LeNet-5 was to recognize handwritten digits (see paper).
In the first step, we use a set of six 5×5 filters with a stride of one. Because we used six filters, we end up with a shape of 28x28x6 and with a stride of one and no padding.
The output is calculated as follow:
(n +2p -f )/ s +1 * (n +2p -f )/ s +1 * Nc, where
- Nc: is Number of channels = number of filters used to convolve our inputs
- p: padding (no Padding in LeNet-5)
- s: stride
(n +2p -f )/ s +1 * (n +2p -f )/ s +1 * Nc
= (32+0–5)/1 +1 * (32+0–5)/1 +1 * 6
=28 *28 * 6
As we remark, the image dimensions reduce from 32×32 down to 28×28. Then the LeNet neural network applies pooling. And back then when this paper was published, people used average pooling much more than max pooling.
Nowadays, if we’re building a modern variant, we probably would use max pooling instead.
But in this example, let’s stick with the original paper. We end up with a 14x14x6 volume. We can use the same formula to calculate new dimensions.
Next, we apply another convolutional layer. This time we use a set of 16 filters, each with a 5 by 5 shape. so we end up with a 10x10x16 volume. And one last time we apply another average pooling where finally the dimensions will be 5x5x16 which give a total of 400 parameters. In the end, we have 2 fully connected layers where the first one fully connects each of these 400 nodes with every one of 120 neurons. Then, the same with the last fully connected layer that fully connects each of these 120 nodes with every one of 84 nodes. Finally, we have the output layer, where a softmax activation function is used for predictions (originally RBF activation function was used, but it fell out of relevance these days). ŷ took on 10 possible values corresponding to recognizing each of the digits from 0 to 9.
As we can see, in this architecture, the image shrinks from 32x32x1 to 5x5x16 while the number of channels used increases: it goes from 1 to 6 to 16 as you go deeper into the layers of the network. And back when this paper was written in 1998, people didn’t really use padding. They always used valid convolutions (p=0). So, pixels in edges or corners were less used than others, and we were throwing a lot of useful information. It is not the case in this example, because corners do not present relevant features but in other use cases, it might be a problem to solve.
In Yann LeCun website (LeNet section), you can find great demos of LeNet-5 classifying digits.
The AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenge by a large margin. It achieved a 17% top-5 error rate while the second-best achieved only 26%! It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. It is quite similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer.
We won’t explain it in detail like LeNet-5 but this is what we can resume from the figure above:
It is also relevant to know that there is a variant of AlexNet called ZF Net, which was developed by Matthew Zeiler and Rob Fergus. It won the 2013 ILSVRC challenge. It is essentially AlexNet with a few tweaked hyperparameters (number of feature maps, kernel size, stride, etc.).
As seen in the AlexNet architecture, CNNs were starting to get deeper and deeper. The most straightforward way of improving the performance of deep neural networks is by increasing their size. VGG (Visual Geometry Group) invented the VGG-16, which has 13 convolutional and 3 fully-connected layers, carrying with them the Relu activation function from AlexNet.
Also, it took the tradition of stacking layers from AlexNet but uses a smaller size of filters (2×2 and 3×3). It consists of 138M parameters and takes up about 500MB of storage space. They also designed a deeper variant called VGG-19.
Last but not least, the winner of the ILSVC 2015 challenge was the residual network (ResNet), developed by Kaiming He et al., which delivered an astounding top-5 error rate under 3.6%, using an extremely deep CNN composed of 152 layers. The key to being able to train such a deep network is the skip connections: The signal feeding into a layer is also added to the output of a layer located a bit higher up the stack. Let’s see first the ResNet architecture and discuss why it is useful.
ResNets are built out of something called a residual block. Let’s see exactly what happens or what is the utility of a residual block.
When we initialize a regular neural network, its weights are close to zero, and so the network just outputs values close to zero. If we add a skip connection, the resulting network just outputs a copy of its inputs. So, it initially models its identity function and this speeds up training considerably. The identity function is easy for the residual block to learn and it’s easy to get a[l+2] equals a[l] because of this skip connection (l – layer, a – activation).
And it means that adding these two layers in our neural network doesn’t really hurt its ability to do as well as in simpler network without these two extra layers, because it’s quite easy for it to learn the identity function with only copying a[l] to a[l+2] despite the addition of these two layers. So adding this residual block to somewhere in the middle or the end of this big neural network doesn’t hurt performance. On the contrary, it improves performance. So, you can imagine that with all of these heading units (if they actually learned something useful) we can do even better than by learning the identity function.
Also, very deep neural networks are difficult to train because of vanishing and exploding gradient types of problems. But ResNet with these skip connections allows you to take the activation from one layer and suddenly feed it to another layer even much deeper in the neural network. And during backpropagation, skip connection’s path will pass gradient update as well. Conceptually this update acts similar to the synthetic gradient’s purpose.
Instead of waiting for the gradient to propagate back one layer at a time, skip connection’s path allows the gradient to reach those beginning nodes with greater magnitude by skipping some layers in between.
This article was originally published on Medium and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more computer vision updates.
We’ll let you know when we release more technical education.