In the last post, I talked about using trees as a way to make the semantics or meaning of a sentence clearer to a neural network. I also gave an example of a test that can assess this hypothesis: question answering using the SQuAD data set. In this post I look at how the test works and how one of the best performing neural network, QANet, approaches the challenge.
SQuAD stands for Stanford Question Answering Dataset. It is a collection of passages from Wikipedia, questions on those passages and the correct answers to the question. The answers are all snippets taken from the passage which makes the task simpler. A neural network answering the question only has to decide where the answer is located rather than construct English sentences. Here is an example of a passage and questions on that passage:
“When finally Edward the Confessor returned from his father’s refuge in 1041, at the invitation of his half-brother Harthacnut, he brought with him a Norman-educated mind. He also brought many Norman counsellors and fighters, some of whom established an English cavalry force. This concept never really took root, but it is a typical example of the attitudes of Edward. He appointed Robert of Jumièges archbishop of Canterbury and made Ralph the Timid earl of Hereford. He invited his brother-in-law Eustace II, Count of Boulogne to his court in 1051, an event which resulted in the greatest of early conflicts between Saxon and Norman and ultimately resulted in the exile of Earl Godwin of Wessex.”
Question: Who was Edward the Confessor’s half-brother?
Question: When did Edward return?
Answer: in 1041 or just 1041
The neural network used for this challenge has to learn where to extract the answer for a question and passage that it has never seen before. It has to learn, for instance, that for a question that starts with “When” the answer is a number representing a year or time. This is what QANet does albeit in very innovative ways.
QANet uses convolutions and self attention (skip connections like ResNet) whereas most proposed solutions use RNNs. As a result, QANet manages to get 84.4% of the questions right. Even though this is only 0.4% better than the second runner up, the network also trains about four to nine times faster.
QANet is split into the following five steps:
Input Embedding layer – Converting input into numbers.
The input of the challenge are sentences of words i.e. words that make up the question and the words in the passage. Since a neural network can only do calculations with numbers the words have to be converted into numbers. Similar or associated words must have numbers that are closer together so the neural network knows that there is relationship between them. This is called word embedding. To more clearly show the relationships between words, each word is represented by a series of 300 numbers. Also, only the commonly used words have an embedding (number representation) otherwise we would need more than 300 numbers. This leaves some words without an embedding so to fix that, the words given an embedding reserved for unknown words. In addition, 200 more numbers are added to the each embedding based on the characters in the word. This way, even the unknown words have a unique embedding.
Embedding Encoder – Stacked Neural Network layers
The second step is made up of multiple convolution layers, which are just like in convolutional neural networks except the input is one continuous row. The filter (kernel) is a single row of parameters. In addition, this part has fully connected layers and skip connections (like ResNet).
Combine the Passage with Question
The first and second steps are applied to the passage and the questions separately. This steps combines the two by multiplying the pairs of corresponding outputs together and feeding everything (the two outputs and the result of the multiplication) into a layer of neurons.
More Stacked Neural Network layers
After combining the passage and the question, more convolutions are applied. This step makes use of convolution layers, fully connected layers and skip connections.
In the final step, the neural network splits into two streams that, together, predict the location of the answer in the passage. One stream predicts the start location of the answer and the other predicts the end location. Both streams are made up of fully connected layers
QANet learns the same way as the neural networks in the previous posts. The parameters are nudged slowly in the right direction as the examples of passages, questions and answers are tested on it. In essence, what QANet learns are the relationships between words [and word order] in the passage and corresponding questions, as well where the answer appears in the passage with respect to those relationships.
It is remarkable that such a process can be used to answer questions about new passages with an accuracy of roughly 84%. Also, once the network has been trained, answering a question takes a few seconds, which is faster than a person can read passage and answer questions. This is very useful for applications that require quick searches for answers to simple questions. Here, the benefits far outweigh the costs. However, there is only so far we can go with a solution that answers questions without understanding the meaning of the sentences in the passage and question. Also, such a solution cannot adapt to changes. For example, if the format changes slightly such that to correctly answer the question we need two or more answers at different parts of the passage this neural network won’t be able to correctly answer it. In fact, answering such a question with a neural network would require modifying the network, creating a new dataset with two or more answers per question and retraining the network. There is very little we can do to reuse what we already have. From this example alone it looks like we have a long way to go in this area but in the meantime we can benefit from the strengths of neural networks.