Deep Learning with Trees Instead of Sentences – Project

In a posts from two weeks ago  I talked about trees and how they expose more information about a sentence. This week I trained a model to measure the similarities in two sentences. Here is a summary of the project, what worked and what did not, as well as the next steps.

Deep learning has spurred significant advances in Natural Language Processing. Models like Neural Machine Translation have transformed our approach to language translation with wild success. However, most natural language models use sentences as their input and sentence are ambiguous. Two words next to each other in a sentence may not necessarily be communicating connected parts of the message. For instance, in the sentence, “Nancy took the glass”, the words ‘took’ and ‘the’ are next to each other but they are not connected parts of the message. As a result, language models that use sentences take advantage of only a subset of the information in the sentence.

In this project I use tree representations of sentences (called dependency trees) as a way to make the relationships between words more apparent for deep learning models. I test multiple approaches including converting the the sentence into a linear representation of a tree before feeding it into an LSTM, and modifying the LSTM cell to be applied on tree structured data as described in Kai Tai et al, 2015. To test the added benefit of having clear relationships between words in the sentence, I compare the performance of a models using regular sentences and the others using tree represented sentences on a task that requires some semantic understanding like the relatedness of two sentences and question answering.


Deep Learning Models for Sequences

The most common type of model for data in the form of a sequence, like a sentence, is a recurrent neural network (RNN). An RNN operates on one element of the sequence at a time, starting from the beginning of the sequence, and each it time it generates a new state. The state is a vector that encapsulates all the information that the RNN has seen i.e. previous states and previous inputs. A version of RNN called Long Short Term Memory (LSTM) improves on this model by adding to the state another vector that better persists information over long sequences. The difference between the two states is that there is no activation on the vector that retains more information.

Fig. 1.0 LSTM unit with an input sequence of a phrase that has been converted into a sequence of vectors.


Dependency Trees  

This is a short recap from this post. Relationships between words in a sentence are not apparent from the order of the words. When people read sentences they infer these relationships from the context as well their understanding of the language. To show these relationships or connections one can represent the sentence as a tree. There are two parse trees commonly used for sentences in linguistics, dependency trees and constituency trees. Dependency trees, the simpler of the two, mandate that each word is linked to or dependent on one other word  except the word at the root of the tree. In addition, several words could be linked to or dependent on the same word. A sentence like, “The young boys are playing outside”, therefore, has the following dependency tree:

Dependency tree
Fig. 2.0 Dependency tree for the sentence “The young boys are playing outside.”

The tree shows the hierarchical nature of the relationships between words in the sentence. For instance, ‘the’ is a child of ‘boys’ because it’s a determinant that modifies ‘boys’ to tells more information about the specific boys to which the sentence refers. Dependency trees were chosen for this project over constituency tree because they are simpler and contain fewer nodes.

Data Preprocessing

In order to convert the dataset to dependency trees I use SyntaxNet, also known as Parsey McParseface. SyntaxNet works by first assigning part-of-speech (POS) tags -shown in red above- to each word. POS tag describe the purpose of the word and there is a finite set of these defined by the Universal Dependencies project. Assigning a POS tag is done from left to right using a feedforward neural network that takes the current word, surround words and their POS tag, if any, as features. Based on the tags, a head is selected for each word to create a tree. There have been other dependency tree parsers like the Stanford Parser but ultimately SyntaxNet is the winner when it comes to breadth and even accuracy.


In addition to creating the dependency trees, each word is represented a set of numbers or a vector. There is already a file called Glove that contains all such vectors for 400,000 English words. Words that are close together in meaning have numbers that are closer together. 400,000 words are too many for this project; most of them will not be used so it is necessary to pare down the file to the minimum required to train the model.



This data that is used to train the network for this project can be found at SICK . SICK consists of 10,000 example pairs of sentences along with scores that say how similar the sentences are. The score ranges from one to five with and they have one decimal place. Semantic relatedness is a measure of how much the meaning of two sentences align. It is not something we can actually measure but there is wide agreement on the scores. Both sets of sentences in the dataset have to be run through SyntaxNet separately and the results converted to vectors.


The Networks

The networks used in the project are based on LSTMs

Network architecture
Fig. 3.1 Network architecture. The LSTM unit on the left and the right share parameters across each layer


Tree LSTM – Instead of a linear sequence, a Tree LSTM is applied on a tree, such that at a certain point there could be multiple previous elements in the input


Preliminary Results and Next Steps

After training the first version of the first model on SICK dataset for about twenty hours without fine tuning, the mean square error (MSE) is at 0.59. Expectedly, this is still higher than the benchmarks which range from 0.45 down to .25. However, a better comparison can be achieved by testing the same model with regular sentences under similar training conditions, which is the next step, along with testing the TreeLSTM. A comparison with the benchmarks can then be carried out after fine tuning the model and it’s hyper-parameters. The ultimate goal is to apply similar concepts to question answering using the SQuAD dataset.


Sample best scores from validation dataset


sentence_A sentence_B relatedness_score My Network
A person in a black jacket is doing tricks on a motorbike A skilled person is riding a bicycle on one wheel 3.4 3.632302985
Four children are doing backbends in the gym Four girls are doing backbends and playing outdoors 3.8 3.928847911
Five children are standing in front of a wooden hut Five children are standing in a wooden hut 4.2 4.15494952
Someone is on a black and white motorcycle and is standing on the seat A motorcycle is riding standing up on the seat of the vehicle 3.165 3.140749584
Two dogs are playing by a tree Two dogs are playing by a plant 4.6 4.70986345
A girl in white is dancing The blond girl is dancing in front of the sound equipment 3.3 3.485741673
A boy is standing outside the water The boy is wading through the blue ocean 3.135 3.205507745
An old, topless woman is covered in paint A young, topless woman is covered in paint 3.8 3.972011382
A man isn’t tossing a kid into the swimming pool that is near the ocean The man is tossing a kid into the swimming pool that is near the ocean 3.5 3.435902799
One white dog is staring at the black street Two dogs of different breeds are looking at each other across a street 3.4 3.338895456
A few men in a competition are running outside A few men in a competition are running indoors 3.9 4.128556652
A few men in a competition are running outside A few men are running competitions outside 3.9 3.998302157
The kid is happily sliding in the snow A boy on a hill covered in snow is wearing a red jacket and a black hat and is sliding on his knees 4 3.922705353
One white dog and one black one are running side by side on the grass A dog, which has a black coat, and a white dog are running on the grass 4.2 3.952577375


Sample worst scores from validation dataset

sentence_A sentence_B relatedness_score My Network
Few people are eating at red tables in a restaurant without lights A large group of Asian people is eating at a restaurant 2.6 3.546279109
A girl in white is dancing A girl is wearing white clothes and is dancing 4.9 3.305637399
A group of children in uniforms is standing at a gate and there is no one kissing the mother A crowd of people is near the water 1.3 3.826985499
A man is sitting near a bike and is writing a note A man is standing near a bike and is writing on a piece paper 3.6 4.598034076
Two men are taking a break from a trip on a snowy road Two men with cars are on the side of a snowy road 4 2.972829043
A biker is wearing gear which is black A biker wearing black is breaking the gears 3.1 4.497746795
Not a lot of people are in an ice skating park An ice skating rink placed outdoors is full of people 3.5 4.405066942
Some corndogs are being eaten by two toddlers who are in a wagon, which is really small Two toddlers are eating corndogs in a wagon, which is really small 4.8 3.889615639
A man is getting on a horse on a track laid in the wild A horse is tossing the cowboy wearing pants of blue and red color 2.1 3.568775066
An adult is in the amphitheater and is talking to a boy There is no adult in the amphitheater talking to a boy 4.3 2.702033484
A potato is being sliced by a woman A woman is slicing a carrot 3 4.846798799
A man is talking on the radio A man is spitting 1.7 3.35747112
A man is playing a guitar A man is strumming a guitar 4.9 3.557543101



I spent a lot of the time in the project learning about Tensorflow  (software for deep learning) and translating the Tree LSTM to Tensorflow. The original idea was to split the trees into blocks that I can use in Tensorflow but does not work unless the data fits in a table the aligns in a line from one block to another; trees don’t work like that. Going forward, I will another software called Tensorflow Fold which supports data that does not align in a table like trees.

Leave a Comment