Deep Learning with Sequences

In the last two posts I describe how convolutional neural networks work and how the design of the network takes into account the grid-like shape of the input images by using a grid of neurons. I also describe how sections on the input image are mapped to each neuron in the first layer through the use of a filter and how the stride and the filter size determine the number of neurons needed. One thing that hopefully became clear is that once the convolutional network is built and trained to take images that are, say, 224 pixels wide and 224 pixels high, no other size can be used on that network without changing and retraining the network. Similarly, the neural network in the first post that read digits drawn on a 7×7 grid had the same property: once the network is built and trained to take 7×7=49 inputs one cannot use, say, an 8×8 grid without changing or retraining the network. This becomes an issue if we want to do deep learning on input types that are different sizes.

Challenges of Working with Sequences

Imagine that instead of digits drawn on a 7×7 grid of boxes, we had several letters, each drawn on its own 7×7 grid to make a word. We can call these a sequence of letters, which means a collection of letters that appear in a particular order. We want a neural network to read this sequence of letters and report the word it sees. Both, the letters and their order are important to determine the correct word (“aids” and “said” have the same letters but they are different words because the order of letters is different) so the neural network has to take both into account. Fig. 1.0 A sequence of four letters drawn on a 7×7 grids, spelling the word “said”

Since different words can have different number of letters, there is no single neural network that would work on all word sizes. We could build many neural networks, one for each word size (e.g. a network for words with two letters, another network for words with three letters, another for four letters, and so on), but that would be wasteful and inefficient. Alternatively, we could build one network that takes one letter at a time and remembers the letters it has seen as well as their order. Once it has seen all letters the network should produce an output which tells us which word it saw. It turns out such a network exist and it’s called a recurrent neural network.

Recurrent Neural Networks

A recurrent neural network takes an input in the form of a sequence, one item at a time, and it remembers what sees from one item to the next. Fig 2.0 A Recurrent Neural Network with an input sequence of four letters drawn on a 7×7 grid. Step 1: the neural network takes the first letter as input and produces an output as well as a value to remember. Since the results produced by a neuron are numbers, the remembered value is actually a set of numbers that represent what the network saw. The output is also part of a set of numbers that represent a word. FIg 2.2 Step 2: The neural network moves on to the next letter. Again, the remembered value is not actually the letters “sa”. Instead, it’s set of numbers produced by a set of neurons and that number is supposed to represent remembering “sa” if the the network is perfect. Fig. 2.3 The process continues to the last letter, at which point there should be enough outputs to sufficiently represent the word ‘said’.

The input in the Figure 2.0 to 2.3 is a series of 7×7=49 zeros and ones that represent the clear and shaded boxes in the grid respectively but the neural network does not remember all 49 numbers. Instead, it remember a smaller set of numbers (e.g. 2, 7, 11, 1) that gives the network a clue that it saw the letter s. The question of how many numbers the network uses to remember clues depends on how we build and train the network (e.g when we want a network that read digits drawn on a grid we used ten outputs because they are ten digits from zero to nine).

In addition to remembering what it has seen, the network also produces an output that represents a partial guess of what the words is supposed to be. This partial guess is also a set of numbers produced by some neurons. For the example given in Figure 2.0 to 2.3, the output is not very useful until the network is at the end of the sequence. Once at the end, the network’s output should be a set of numbers that represent a word. This means that before we train our network we should have list of words and the output we expect for each word.

There are many types of sequences as well as different kinds of recurrent neural networks suited for them and/or for specific goals. One of the most common sequences are sentences. Sentences have variable lengths and the order of the words in a sentence is an important part of the meaning of the sentence. In the next post, I will take an in-depth look at Long Short Term Memory (LSTM), a recurrent neural network normally used with sentences.