An End-to-End Trainable Neural Network for handwriting recognition

I) Overview handwritting recognition

Many researches are going on in the field of Optical Character Recognition (OCR) for the last few decades and a lot of articles have been published. Also a large number of OCR system is available commercially. In this literature, a review of the OCR history and the various techniques used for OCR development in the chronological order are being done. Here, we will try to give you  a short explanation.  High level overview is the following

Firstly, an image is feeded to CNN to extract image features. The next step is to feed these features to the RNN  followed by the special decoding algorithm. This decoding algorithm takes LSTM outputs from each time step and produces the final labels.

Detailed architecture will be the following. FC — fully connected layer, SM — softmax layer.

Image has the following shape: height equals to 64, width equals to 128 and num channels equal to three.

As you have seen before, we feed this image tensor to CNN feature extractor and it produces tensor with shape 4*8*4. We put image “apple” to the feature tensor so you can understand how to interpret it. Height equals to 4, width equals to 8  and num channels equals to 4. Thus we transform the input image with 3 channels to 4 channel tensor. In practice number of channels should be much larger, but we constructed small demo network only because everything fit on the slide.

Next we do reshape operation. After that we obtain the sequence of 8 vectors of 16 elements. After that we feed these 8 vectors to the LSTM network and get its output — also the vectors of 16 elements. Then we apply fully connected layer followed by (a) softmax layer and get the vector of 6 elements. This vector contains probability distribution of observing alphabet symbols at each LSTM step.

The number of CNN output vectors can reach 32, 64 or more. The choice will be depended on the specific task. Also in production, it is better to use multilayered bidirectional LSTM. But this simple example explains only the most important concepts.

But How does decoding algorithm work? On the above diagram we have eight vectors of probabilities at each LSTM time step. Let’s take the most probable symbol at each time step. As a result we obtain the string of eight characters — one most probable letter at each time step. Then we have to glue all consecutive repeating characters into one. In our example two “e” letters are glued to single one. Special blank character allows us to split symbols that are repeated in the original labeling. We added blank symbol to the alphabet to teach our neural network to predict blank between such case symbols. Then we remove all blank symbols. Look at the illustration below

After the model training we apply it on images from test set and get (very)really high accuracy.