In a posts from two weeks ago I talked about trees and how they expose more information about a sentence. This week I trained a model to measure the similarities in two sentences. Here is a summary of the project, what worked and what did not, as well as the next steps.
Deep learning has spurred significant advances in Natural Language Processing. Models like Neural Machine Translation have transformed our approach to language translation with wild success. However, most natural language models use sentences as their input and sentence are ambiguous. Two words next to each other in a sentence may not necessarily be communicating connected parts of the message. For instance, in the sentence, “Nancy took the glass”, the words ‘took’ and ‘the’ are next to each other but they are not connected parts of the message. As a result, language models that use sentences take advantage of only a subset of the information in the sentence.
In this project I use tree representations of sentences (called dependency trees) as a way to make the relationships between words more apparent for deep learning models. I test multiple approaches including converting the the sentence into a linear representation of a tree before feeding it into an LSTM, and modifying the LSTM cell to be applied on tree structured data as described in Kai Tai et al, 2015. To test the added benefit of having clear relationships between words in the sentence, I compare the performance of a models using regular sentences and the others using tree represented sentences on a task that requires some semantic understanding like the relatedness of two sentences and question answering.
Deep Learning Models for Sequences
The most common type of model for data in the form of a sequence, like a sentence, is a recurrent neural network (RNN). An RNN operates on one element of the sequence at a time, starting from the beginning of the sequence, and each it time it generates a new state. The state is a vector that encapsulates all the information that the RNN has seen i.e. previous states and previous inputs. A version of RNN called Long Short Term Memory (LSTM) improves on this model by adding to the state another vector that better persists information over long sequences. The difference between the two states is that there is no activation on the vector that retains more information.
This is a short recap from this post. Relationships between words in a sentence are not apparent from the order of the words. When people read sentences they infer these relationships from the context as well their understanding of the language. To show these relationships or connections one can represent the sentence as a tree. There are two parse trees commonly used for sentences in linguistics, dependency trees and constituency trees. Dependency trees, the simpler of the two, mandate that each word is linked to or dependent on one other word except the word at the root of the tree. In addition, several words could be linked to or dependent on the same word. A sentence like, “The young boys are playing outside”, therefore, has the following dependency tree:
The tree shows the hierarchical nature of the relationships between words in the sentence. For instance, ‘the’ is a child of ‘boys’ because it’s a determinant that modifies ‘boys’ to tells more information about the specific boys to which the sentence refers. Dependency trees were chosen for this project over constituency tree because they are simpler and contain fewer nodes.
In order to convert the dataset to dependency trees I use SyntaxNet, also known as Parsey McParseface. SyntaxNet works by first assigning part-of-speech (POS) tags -shown in red above- to each word. POS tag describe the purpose of the word and there is a finite set of these defined by the Universal Dependencies project. Assigning a POS tag is done from left to right using a feedforward neural network that takes the current word, surround words and their POS tag, if any, as features. Based on the tags, a head is selected for each word to create a tree. There have been other dependency tree parsers like the Stanford Parser but ultimately SyntaxNet is the winner when it comes to breadth and even accuracy.
In addition to creating the dependency trees, each word is represented a set of numbers or a vector. There is already a file called Glove that contains all such vectors for 400,000 English words. Words that are close together in meaning have numbers that are closer together. 400,000 words are too many for this project; most of them will not be used so it is necessary to pare down the file to the minimum required to train the model.
This data that is used to train the network for this project can be found at SICK . SICK consists of 10,000 example pairs of sentences along with scores that say how similar the sentences are. The score ranges from one to five with and they have one decimal place. Semantic relatedness is a measure of how much the meaning of two sentences align. It is not something we can actually measure but there is wide agreement on the scores. Both sets of sentences in the dataset have to be run through SyntaxNet separately and the results converted to vectors.
The networks used in the project are based on LSTMs
Preliminary Results and Next Steps
After training the first version of the first model on SICK dataset for about twenty hours without fine tuning, the mean square error (MSE) is at 0.59. Expectedly, this is still higher than the benchmarks which range from 0.45 down to .25. However, a better comparison can be achieved by testing the same model with regular sentences under similar training conditions, which is the next step, along with testing the TreeLSTM. A comparison with the benchmarks can then be carried out after fine tuning the model and it’s hyper-parameters. The ultimate goal is to apply similar concepts to question answering using the SQuAD dataset.
Sample best scores from validation dataset
|A person in a black jacket is doing tricks on a motorbike||A skilled person is riding a bicycle on one wheel||3.4||3.632302985|
|Four children are doing backbends in the gym||Four girls are doing backbends and playing outdoors||3.8||3.928847911|
|Five children are standing in front of a wooden hut||Five children are standing in a wooden hut||4.2||4.15494952|
|Someone is on a black and white motorcycle and is standing on the seat||A motorcycle is riding standing up on the seat of the vehicle||3.165||3.140749584|
|Two dogs are playing by a tree||Two dogs are playing by a plant||4.6||4.70986345|
|A girl in white is dancing||The blond girl is dancing in front of the sound equipment||3.3||3.485741673|
|A boy is standing outside the water||The boy is wading through the blue ocean||3.135||3.205507745|
|An old, topless woman is covered in paint||A young, topless woman is covered in paint||3.8||3.972011382|
|A man isn’t tossing a kid into the swimming pool that is near the ocean||The man is tossing a kid into the swimming pool that is near the ocean||3.5||3.435902799|
|One white dog is staring at the black street||Two dogs of different breeds are looking at each other across a street||3.4||3.338895456|
|A few men in a competition are running outside||A few men in a competition are running indoors||3.9||4.128556652|
|A few men in a competition are running outside||A few men are running competitions outside||3.9||3.998302157|
|The kid is happily sliding in the snow||A boy on a hill covered in snow is wearing a red jacket and a black hat and is sliding on his knees||4||3.922705353|
|One white dog and one black one are running side by side on the grass||A dog, which has a black coat, and a white dog are running on the grass||4.2||3.952577375|
Sample worst scores from validation dataset
|Few people are eating at red tables in a restaurant without lights||A large group of Asian people is eating at a restaurant||2.6||3.546279109|
|A girl in white is dancing||A girl is wearing white clothes and is dancing||4.9||3.305637399|
|A group of children in uniforms is standing at a gate and there is no one kissing the mother||A crowd of people is near the water||1.3||3.826985499|
|A man is sitting near a bike and is writing a note||A man is standing near a bike and is writing on a piece paper||3.6||4.598034076|
|Two men are taking a break from a trip on a snowy road||Two men with cars are on the side of a snowy road||4||2.972829043|
|A biker is wearing gear which is black||A biker wearing black is breaking the gears||3.1||4.497746795|
|Not a lot of people are in an ice skating park||An ice skating rink placed outdoors is full of people||3.5||4.405066942|
|Some corndogs are being eaten by two toddlers who are in a wagon, which is really small||Two toddlers are eating corndogs in a wagon, which is really small||4.8||3.889615639|
|A man is getting on a horse on a track laid in the wild||A horse is tossing the cowboy wearing pants of blue and red color||2.1||3.568775066|
|An adult is in the amphitheater and is talking to a boy||There is no adult in the amphitheater talking to a boy||4.3||2.702033484|
|A potato is being sliced by a woman||A woman is slicing a carrot||3||4.846798799|
|A man is talking on the radio||A man is spitting||1.7||3.35747112|
|A man is playing a guitar||A man is strumming a guitar||4.9||3.557543101|
I spent a lot of the time in the project learning about Tensorflow (software for deep learning) and translating the Tree LSTM to Tensorflow. The original idea was to split the trees into blocks that I can use in Tensorflow but does not work unless the data fits in a table the aligns in a line from one block to another; trees don’t work like that. Going forward, I will another software called Tensorflow Fold which supports data that does not align in a table like trees.