In the last post I introduced convolutional neural networks as a specialized technique for doing deep learning with images and I described how to build one. In this post I will look at record breaking convolutional neural networks like VGG and ResNet. I will focus on the metrics that we use to compare networks and the architectural difference between them.
Considerations to make when Building a Convolutional Neural Network
There are several choices that one makes when building a neural network. In the case of a convolutional neural network these choices are filter size for each convolutional layer, number of filters, number of convolutional blocks, whether or not to use a pooling layer, the pooling method, stride, size of the fully connected group of layers, etc. All these choices can significantly change how well the network performs but unfortunately there is no clear way to find out the best choice for each situation. Instead, researchers test different choices based on their intuition and every so often someone finds a set of choices that result in better performance or record breaking performance. In fact there are competitions to for the best convolutional neural network. The current record holder is a network called Squeeze-and-Excitation Networks by Jie Hu, Li Shen and Gang Sun (2017).
The performance of all competing networks is measured on specific data sets. For instance, a collection of 1 million images called ImageNet is used for the ImageNet Large Scale Visual Recognition Challenge and it’s publicly available for anyone to test on. One of the most important measurements to evaluate networks in a competition looks at the total number of images the network can identify correctly among a set of test images that have never before been seen by the public or the creator of the network. It’s easy to see why this measurement is important. After all, the purpose of these neural networks is to be able to look at an image and say what’s in the image given a choice of say 1000 objects. Other measurements look at the number of parameters in the network, the amount of computational power used to find the parameter or to train the network, the number of convolutional layers and training time.
The network that first popularized convolutional neural networks is LeNet (Yann LeCun et al – 1998). It’s goal was to look at images of zip codes and say what the 5 numbers were. The input images were black and white (only one number to represent color as opposed to 3), 32×32 pixel which, considering the speed of computers at the time, was quite large. The network had a group of 6 convolutional layers connected directly to the image, each with 28×28 neurons and a 5×5 filter. This was followed by a group 6 pooling layers -one for each convolutional layer- which also performed the sigmoid activation. From there, the networks had another set of convolutional layers (16 layers and 16 filters) and pooling layers (16 layers with sigmoid activation and a size of 5×5 each). This second set of convolutional layers were connected to the previous pooling layers. Lastly there was a set of fully connected layers.
Subsequent networks to LeNet like AlexNet (Alex Krizhevsky, Ilya Sutskever and Geoff Hinton -2012) which won ImageNet Large Scale Visual Recognition Challenge in 2012 used more convolutional and pooling layers with more neurons. In 2014, GoogleNet (Szegedy et al – 2014) introduced the idea of inception modules. When using inception modules, instead of choosing a one filter size the network uses multiple sizes and the one that performs best is used.
VGGNet, a very standard convolutional network with groups of convolutions layers having 3×3 filters, separated by 2×2 pooling layers and one fully connected group at the end, came second in ImageNet Large Scale Visual Recognition Challenge 2014. The network had a small number of layers and it was able to use additional layers (up to 16 groups) to increase performance without making significant changes to the standard structure of a convolutional neural network.
Adding more layers to a network should intuitively make the network better and VGG experiences this improvement up to a point. Beyond a certain point adding layers does not help because the gradient has become too small (vanishing gradient) to continue with training. ResNet (Kaiming He et al – 2016) addresses this problem using skip connections. Skip connections are added to an otherwise standard convolutional network like VGG and they allow the output of one convolutional group to travel two groups down the chain at which point they are added to the output. Skip connections allow networks to learn additional or residual information that the network does not learn when information simply flows from one convolutional group to another.
The techniques that were integral in the performance boosts seen in these networks show that there is no clear formula for building a perfect convolutional network. The field is still in its early stages of development and we continue to discover new techniques that improve the performance and reduce the size of the network. In practise, when one is applying convolutional neural networks to a problem it is common practice to start with one of the best performing networks like ResNet and make changes where necessary.