Detecting Text in Natural Image with Connectionist Text Proposal Network

Intelligent character recognition or handwriting recognition system is an advanced OCR and a hard problem in computer science nowadays. Why is this problem hard to solve ?

Because ICR contains 2 parts: Text localization and Text recognition.

We shall go into details of Text localization. The technique that we use here is CTPN (Connectionist Text Proposal Network) ( https://arxiv.org/pdf/1609.03605.pdf )

We use an CNN – LSTM architecture.

So what input do we need to train this architecture ?

We use 2 datasets of images: COCO-text, PASCAL VOC and their annotations ( ground truth boxes and labels ).

The CNN model we use is the pre-trained model VGG 16, we drop the last fully connected layer and replace a convolutional layer with a sliding window 3 x 3. Output of this layer is the features extracted and the input of the LSTM model. The output of LSTM is connected to a 512-D fully connected layer, followed by the output layer to predict the confidence score ( text or non-text ), the coordinates of anchors.

The difference between CTPN and the other models is the anchors. Since our target is text in image, but text differs from objects, so anchors of text differs from anchors of object. Text is a sequence which does not have an obvious closed boundary. It may include multi-level components, such as stroke, character, word, text line and text region, which are not distinguished clearly between each other. So the images below is the results that we received by using 2 type of anchors ( Left: the anchors of object detection, Right: the anchors with fixed-width ). 

” The detection processing is summarized as follow. Given an input image, we got WxHxC conv5 features maps by using the VGG16 model ( WxH is the spatial arrangement and C is the number of feature maps or channels ). When the detector is sliding a 3×3 window densely through the conv5, each sliding-window takes a convolutional feature of 3x3xC for producing the prediction. For each prediction, the horizontal location (x-coordinates) and k-anchor locations are fixed, which can be pre-computed by mapping the spatial window location in the conv5 onto the input image. The detector’s outputs is the text/non-text scores and the predicted y-coordinates for k anchors at each window location. The detected text proposals are generated from the anchors having a text/non-text score of > 0.7. By the designed vertical anchor and fine-scale detection strategy, the detector is able to handle text lines in a wide range of scales and aspect ratios by using a single-scale image. ~ https://arxiv.org/pdf/1609.03605.pdf

After training the model, we tested some images and used NMS to clean the duplicate predicted boxes.

Some images below is the positive results we test: