Long Short Term Memory (LSTM) and How to build one from Scratch with Tensorflow

Starting with the last two posts, I decided to do a deep dive into Recurrent Neural Networks because they are a very broad topic and some of the cutting edge applications of Deep Learning use RNNs. This post is the third installment of the RNN series but it will be slightly different. I’m going to break from precedence and write in technical language in the bottom half of the post. I realized that while there is some fairly substantial learning material for building RNNs from scratch using Tensorflow, like the RNN tutorial here, the materials seems to have been written for deep learning experts, leaving a gap for beginners or even intermediates. My hope is that this post will make it easier to approach and write Tensorflow code for a basic RNN.



In the last post, I modified and ran Andrej Karpathy’s basic RNN that learns to predict the next character in a sentence from Shakespeare’s works. The RNN did not come close to composing verse but it learned the structure of sentences and paragraphs until it started producing what looked like Shakespeare-like paragraph and some english words. The RNN learned some things but I did not pick up all patterns. To fix this we could try something that we learned in regular neural networks and convolutional neural networks. We add layers. Also, it turns out that the basic RNN that I used does not do a good job of remembering clues. As the sequence gets longer, the network loses some clues and to solve this we use a special RNN called Long Short Term Memory or LSTM in short. In an LSTM, there is a second clues that does not go through layers of neurons. At each part of the sequence, an LSTM adds or subtracts some numbers from the clue. This corresponds to remembering new clues and forget some old ones.


Multilayer LSTM
Fig 1.0 Adding another layer of RNN unit and a special RNN called Long Short Term Memory improves the strength of the network


Building An LSTM

The following is a synopsis of how to build an LSTM like the one above but for the task in the previous post which is predicting characters to make up sentences. Same as before, the network is trained on Shakespeare’s work.


  1. Start building the network from the input. The input is a text file of all Shakespeare’s work available here. Split the text into sentences that are 25 character long -regardless of where actual sentences start and end- and group these sentences of sequences into batches of, say, 20 each.
  2. Convert the character in each sentence to numbers. The character “a” could 1, ‘b’ could be 2 and so forth. From there, change the numbers to a series of zeros and ones. The character ‘a’ which is represented by one would be 00000000000000000000000001, ‘b’ would be 00000000000000000000000010 and so forth
  3. Join two LSTM units to each other such that the output of the first is the input of the second
  4. Add one layer of neurons to the output of the second LSTM unit. The purpose of this layer is to make sure that the output has enough values to represent the ones and zeros of character e.g. 00000000000000000000000100 for ‘c’. Also, this layer makes sure that the output values are between 0 and 1
  5. Calculate the correctness score of the network
  6. Train the network on one batch of inputs at a time i.e. find the values of parameters that give the highest correctness score

Here is the actual code and comments explaining each piece (work in progress). Some snippets of the code are from Tensorflow code for the RNN tutorial

from __future__ import division
from __future__ import print_function

import time
import numpy as np
import tensorflow as tf
import math
import reader
from tensorflow.python.client import device_lib
batch_size = 500
hidden_size = 200
vocab_size = 0 #set down below
data_type = tf.float64 
num_layers = 2
max_grad_norm = 5

#Read the input file 
#find out how many unique characters are in the file 
#create two dictionaries to convert between a char and a char id 
data = open('shakespeare.txt', 'r').read() 
chars = list(set(data))
epoch_size = ((len(data) // batch_size) - 1) 
data_size, vocab_size = len(data), len(chars)
print('data has %d characters, %d unique.' % (data_size, vocab_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

#Group the char from the input file into random batches
#Each batch contains characters from various parts of the file
batch_contiguous_space_size = (len(data)-1)//(batch_size) #Also number_of_batches
all_input_batches =[[0 for m in range(batch_size)] for l in range(batch_contiguous_space_size)]
all_target_batches = [[0 for m in range(batch_size)] for l in range(batch_contiguous_space_size)]
for i in range(batch_contiguous_space_size):
    for j in range(batch_size):
        all_input_batches[i][j] = char_to_ix[data[i+batch_contiguous_space_size*j]]
        all_target_batches[i][j] = char_to_ix[data[i+batch_contiguous_space_size*j+1]]

#Create the session that will be used to run the the training and inference 	
session = tf.Session()

#Placeholders for the input and target chars
inputs = tf.placeholder(tf.int32, shape=(batch_size))
targets = tf.placeholder(tf.int32, shape=(batch_size))

#Create an embedding matrix to convert char ids to a one-hot representation
embedding = tf.constant(np.identity(vocab_size))
input_batch = tf.nn.embedding_lookup(embedding, inputs)
targets_batch = tf.nn.embedding_lookup(embedding, targets)

"""Build the graph of the model"""

#We will be using multiple LSTM units. This function makes it easy to generate the cells
def make_cell():
	cell = tf.contrib.rnn.BasicLSTMCell(
          hidden_size, forget_bias=0.0, state_is_tuple=True,
	return cell
 #The LSTM has 2 layers. A MultiRNNCell holds the layers together and make it easier to 
 #run operations on the two layers at the same time              
cell = tf.contrib.rnn.MultiRNNCell(
        [make_cell() for _ in range(num_layers)], state_is_tuple=True)
initial_state = cell.zero_state(batch_size, data_type)

#Feed the embedded input into the LSTM and get the final state and the outputs 
outputs, final_state = cell(input_batch, state=initial_state)

#The out is a vector of size hidden_size. It would be great to get a vector that matches 
#the number of characters we have (vocab_size) so add another layer on top of the LSTM 
#that produces an output vector of size vocab_size
softmax_w = tf.get_variable("softmax_w", [hidden_size, vocab_size], data_type)
softmax_b = tf.get_variable("softmax_b", [vocab_size], data_type)
logits = tf.nn.xw_plus_b(outputs, softmax_w, softmax_b)

#Calculate the loss between the targets and the logits 
loss = tf.losses.softmax_cross_entropy(

#LSTM cell calculates the gradient for us
#Using the gradient and the loss, add a Gradient Descent Optimizer to the graph
lr = tf.Variable(0.01, trainable=False)
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars),
optimizer = tf.train.GradientDescentOptimizer(lr)
train_op = optimizer.apply_gradients(
        zip(grads, tvars),

#What's left is to train the model, but before we do that 
#lets get a sample output from the model. 
#When we sample there is no need to use batches. We just feed in
#one input at a time, so we need a different set of input tensors 
#in order to get a sample. We use the same LSTM cells and the same 
#output layer
single_input = tf.placeholder(tf.int32, shape=(1,))
single_input_embedded = tf.nn.embedding_lookup(embedding, single_input)
inference_intial_state = cell.zero_state(1, data_type)
single_output, final_inference_state = cell(single_input_embedded, state=inference_intial_state)
single_output = tf.reshape(tf.concat(single_output, 1), [-1, hidden_size])

single_logits = tf.nn.xw_plus_b(single_output, softmax_w, softmax_b)

"""Train the model"""

start_time = time.time()
costs = 0.0
iters = 0
writer = tf.summary.FileWriter("./", session.graph)
training_state = session.run(initial_state)

#initialize variables 

#This is what we want run in the training
fetches = {
  "cost": loss,
  "final_state": final_state,
  "train_op": train_op

#One Training loop or epoch for now
start_time = time.time()
for step in range(epoch_size):
	#the batches we want to use as input and target
    feed_dict = {inputs:all_input_batches[step], targets:all_target_batches[step]}
    for i, (c, h) in enumerate(initial_state):
    	#Pass on the state from one batch to the next
        feed_dict[c] = training_state[i].c
        feed_dict[h] = training_state[i].h

    vals = session.run(fetches, feed_dict)
    cost = vals["cost"]
    training_state = vals["final_state"]

    costs += cost
    iters += 1

    if step % (epoch_size // 100) == 10:
        print("%.3f perplexity: %.3f speed: %.0f wps" %
                (step * 1.0 / epoch_size, np.exp(costs / iters),
                iters * batch_size * max(1, 1) /
                (time.time() - start_time)))
    valid_perplexity = np.exp(costs / iters)

"""Get a Sample from the model"""
#use the input with no batches since we will feed in one char at a time
sample_size = 100
seed_char = 'A'
predicted_chars = [seed_char]
inference_fetches = {
  "final_inference_state": final_inference_state,
  "single_logits": single_logits
inference_state = session.run(inference_intial_state)
for i in range(sample_size):
    input_char_ix = char_to_ix[predicted_chars[-1]]
    inference_feed_dict = {single_input:[input_char_ix]}
    for i, (c, h) in enumerate(inference_intial_state):
        inference_feed_dict[c] = inference_state[i].c
        inference_feed_dict[h] = inference_state[i].h

    inference_vals = session.run(inference_fetches, inference_feed_dict)
    inference_state = inference_vals["final_inference_state"]
Simple LSTM for Next Char Prediction

Leave a Comment