In the last post I introduced the simple model of a neuron and used it in two examples: comparing two numbers and identifying digits drawn on a 7×7 grid of squares. In this post, I will go a step further and build a network that can see a photo and correctly identify what is in the photo.
Something that became clear in the construction of the neural network for the 7×7 grid digits is how quickly the number of neurons and calculations grow as the input becomes more complex. Working with photos or images will further demonstrate this. For instance, a small photo that has a size of 300×300 is made up of 300×300=90,000 little dots, called pixels, arranged in rows of 300. In addition, each dot or pixel is represented by 3 numbers, one each for red, blue and green. Each of these numbers can take any value between 0 and 255 and they indicate the intensity of each those colors in that one pixel of the image. This means our 300×300 image is represented 90,000 x 3 = 270,000 numbers. That’s how many numbers are in the input of the neural network we will need.
With 270,000 numbers in the input each neuron in the first layer alone would have to multiply 270,000 pairs of numbers and add the results together. Not only that, the training process requires doing this calculation several hundreds, if not thousands of times and that’s only for a small image.
There is also certain information we know about images. For example, pixels that are next to each other may be related. If the image was a photo of someone’s face, then pixels next to each other might have the same or a similar color because parts of the face close to each other have the same or similar color. The traditional neural network does not take into account pixels’ proximities to each other, and therefore might be leaving valuable information on the able.
In order to take both these factors into account we are going to have to rethink our neural network. Fortunately there is already a technique that was developed to take all this into consideration and it’s called Convolutional Neural Network (Zhang, Wei – 1991).
Convolutional Neural Network Architecture
Convolutional neural networks (CNNs) use most of the same concepts from neural networks with some modifications. CNNs have layers that perform specific tasks and we will go through them one by one.
First, we have the input layer. Unlike last time where we took the rows of the 7×7 grid and joined them into one input column, this time the input will be a grid of pixel, just like the actual image. After the input, there are three layers in the following order: Convolutional layer(s), Pooling Layer(s) and a Fully-Connected Layer.
Instead of connecting each neuron to all the input values we will connect a neuron only to a small piece of the image inside a square with a size of, say, 3×3 pixels as shown in fig 3.1 and fig 3.2. The first neuron is connected the first set of 3×3 pixel shown with the red square. When we move to the next neuron on the right the red square also moves one pixel to the right to show the pixels connect to the that neuron. (Note: The red square does not have to move one pixel only. We can choose to move it by two pixels to go to the next neuron or two pixels down to go to the next row. This is called the stride and it is related to the total number of neurons we need —if the stride is large we get to the edge of the image with a smaller number of steps than when we use a small stride.)
That’s a total of nine pixels and since each pixel is represented by three numbers for the color the total number of input numbers for a single neuron is 9*3=27 numbers. For 27 input numbers the neuron will need 27 parameters and one bias. Unlike a regular neural network where each neuron has its own parameters in a convolutional network all neurons in the same convolutional layer share the same parameters and bias. This set of parameters is called a filter and sharing a filter significantly reduces the number of parameters we have to find in order to train the neural network.
We can choose to add another convolutional layer connected the input image in the same way. The only difference is that the second layer will have its own set of parameters (filter) different from the first, everything else will be the same. Also, the last post introduced activation. The neurons in the convolution layer also have an activation but sometimes it is separated into its own layer. In this post we will assume that activation is included in the neurons.
After running the activation on the neurons of a convolutional network we can end up with a set of results that is as big as the input image. For a 300×300 pixel image that’s a total of 90 000 results. To make sense of this information and extract useful insights we need to pare down that number . For this we use a pooling layer. A pooling layer extracts some information from a convolutional layer. As Figure 3.3 illustrates, one way of doing this is to pick the highest results for every 2×2 group of neurons.
After adding a pooling layer to each convolutional layer we get a smaller more manageable set of results. The next step is to add a fully connected layer of neurons. This is a regular neural network like the ones from the previous post. For example, say after pooling we end up with 2 pooling layers with a total of 25 values each, that’s means we 50 input values going into the fully connected neural network. In the last layer of the neural network there will be as many neurons as the number of classes of object that we want the convolutional neural network to identify in images. For example, if we want the network to report whether the pet in a photo is any one of ten different kinds of pets (e.g. cat, dog, turtle, spider, etc), then the last layer will have ten neurons. We select the pet that shows up on each neuron during training, specifically when we measure correctness —if the first neuron is marked correct when the photo is a dog then the network will learn to use the first neuron when the image is a dog.
The training process for a convolutional network is virtually the same as a regular neural network. The goal is to find the best set of parameters in all layers, the convolutional layers and the fully connected layers. We measure the correctness of the network on some examples, find the gradient for each parameter, increase or decrease the parameter depending on gradient and repeat until the network is as correct as possible.
The neural network that I described in the previous post was a general neural network. Any data that can be represented as a series of numbers can be made to work with the network. Convolutional neural networks use the concept of neural networks and add changes that take into account what we already know about the input. This is a recurring theme in Deep Learning. Concepts build onto each other. In the next post I will look at specific cases that take the idea of a convolutional neural network and introduce some modifications to increase performance.