In the last post I highlighted precautions that have to be taken when preparing data for a deep learning as well as the problems that may arise when proper care is not taken in handling and preparing data. In this post I will talk about the training process -the tools to use, steps to take, challenges that may arise and how to deal with the challenges.
Hardware for Deep Learning
Deep learning networks are written in software and therefore run on computers. Someone who knows how to program could write the software for a Recurrent Neural Network (RNN) from scratch following the description I give in my previous posts, but fortunately there is no need to do that since there are already a lot of free (open source) software packages that have different kinds neural networks, including RNNs, already written and ready to use. Most Deep Learning practitioners just pick one of these packages and customize the available networks to solve a particular problem. I use a software package called TensorFlow which was developed at Google and has been growing in popularity in the last few years. The networks that I build using Tensorflow can run on my laptop and there is even a version that will run on my phone. However, in a lot of cases the number of calculations that have to be carried out in order to train the network are so numerous that it would take weeks or even months on a laptop. Instead, it is common practice to use powerful computers housed in data centers. Such computers are also commonly referred to as the cloud and there are several such services e.g. Amazon AWS, Google Cloud, Microsoft Azure, IBM Cloud, etc.
Inside a computer, the part that is most essential is the processor or the Central Processing Unit (CPU). That is where all the calculations are carried out and where software is run. When computers were first developing, the processor came first then over time other components like the screen developed. As screens became bigger and started to display more complex graphics and more colors, computer manufacturers started to add another processor called a Graphical Processing Unit (GPU) to handle the graphics on the screen. To do its job, the GPU runs multiple calculations at the same time with numbers that have a lot of decimal places (floating point numbers). The GPU is both very good at that and very fast. It turns out that running lots of calculations at the same time and very fast is extremely useful in deep learning. For example, if we have thousands of neurons in one layer, running each neuron’s calculations at the same time can really reduce training time. In fact that is exactly how software packages like TensorFlow run calculations in the same layer. When a GPU is available, the software package speeds up calculations by using GPUs. This process is called Hardware Acceleration.
Monitoring Training and Testing
The next piece of the process after the software package and the hardware (the computers) is testing the network on actual input data. Cloud services, particularly the one that have GPUs, can be costly and the cost is charged per hour so before training on those computers it’s important to test on a small subset of the data and make sure there are no errors. For example, before training the second version of the RNN from this post, I first tested it to see what it produce and got the following
I gave the network the first character in each line and then I had it generate the subsequent characters in the line. This result seemed weird because even without training, the network was repeating the same three character: a, o and ‘. When I would restart the network the same issue would reappear but each time with a different set of three or two characters. It tooks a little research to find out that because training the network is actually guessing all characters as the next character with the about the same confidence -which is what you want- but two or three character had just a fraction more confidence than others. Changing the input character did not change the confidence because the network has not learned to pay attention to the input.
With that mystery solved I trained the network on just a few batches of Shakespeare text and looked at the output. Again, I got some weird output. Apart from the input character that I provided, the output was made up of spaces only:
I looked at the Shakespeare text to see if there was a reason the network would learn to produce spaces and sure enough the text of 5.4 million characters contained 1.3 million spaces. In comparison, the most common letter, e, followed in second place at only 0.4 million. The network had actually learned that producing spaces was the easiest way to make an output that looked like Shakespeare’s work i.e. spaces are closer to Shakespeare than the first output above with a, o and ‘ characters. From there, I let the network train using cloud computers. I wrote the software in such a way that occasionally the network would be asked to produce an output. That way I could monitor the progress and see if the output was getting closer to Shakespeare. It took a while but after two nights I started getting the following phrases:
st thou shalt not so strange and the world the worl
wing a dog, and the state the matter than the stren
chard, I will not so far the strong and the strange
FINCE. Well, my lord, the seasons of the stronger o
Enter CLARENCE THIBE COMPERICE. PROHIB
In this case also, I gave the network the first character and had it generate the subsequent characters in each line. In the future, something else I might investigate given such an output is repetition of words and why they occur.
In cases where there is an expected answer that the network is supposed to produce given a certain input, it is extremely important to test the network after training using data that the network has not seen before. Deep learning practitioners will usually split the example inputs into three parts. One for training, the second for validating the training and the third one for testing the network. During training, the training example inputs are used to find the parameters that enable the network to produce the best results for each input. After training, the network is tested against the validation inputs. During this step the network can be modified without changing the parameters. After that step and after the practitioner is satisfied, they will test the network on the last portion of example inputs to make sure the network will behave correctly on data that it has not seen. The last set of example inputs is kept secret (the practitioner does not look at it or use it) until the last step.
It’s important to make sure that the network is learning the correct patterns and the steps described above help monitor training. Also, using the appropriate hardware and software will save a lot of time and make the process a lot easier.