In Computer Vision, Object Detection is an important problem with a variety of applications in different areas, such as self-driving cars, tracking objects and pedestrians, video surveillance system, or identify anomalies in security…
Generally, object detection involves detecting instances of objects from a known class such as ‘people’, ‘car’ or ‘face’ in an image. In more details, object detection outputs the location of the object, which can be represented by a bounding box drawn around the object and its respective label.
Thanks to the advancement of deep learning, the field of object detection has seen tremendous progress. Every year, new algorithms keep on outperforming the previous ones. Prior approaches of object detection proposed pipelines consist of separated stages in a sequence but nowadays, state-of-the-art approaches tend to build end-to-end model which not only give better accuracy but also improve detection speed to process digital image and video in real-time.
This article will explain YOLO (YOU ONLY LOOK ONCE), which can be considered as one of the best object detection algorithms. The name YOU ONLY LOOK ONCE expresses the idea of this algorithm: to “look” at the image just once. Instead of running the classifier multiple times for multiple subimages like traditional algorithms, you only pass the whole image to YOLO model once which decreases the processing time significantly. In other words, you would get all the bounding boxes as well as the object category classifications in one go.
In YOLO algorithm, the original image is divided into SxS grid (for example 13×13 in the below image). Each grid cell predicts a fixed number of bounding boxes and confidence scores for those boxes so each bounding box consists of 5 predictions: x,y,w,h and confidence. The confidence here is only for binary classification that those bounding box contains object or not. In addition, each grid cell also predicts C conditional class probabilities. Specifically, if B is the number of bounding box for each grid cell and C is the number of class, these predictions are encoded as an SxSx(B*5+C) tensor.
- Network design
The architecture of the model can be seen as below):
In this picture, the image is resized to 448×448 image. Specifically, with S=7, B=2, C=20 the shape of the final prediction is (7,7,2*5+20) = (7,7,30). YOLO model has 24 convolution layers followed by 2 fully connected layers. Output tensor of the last convolution layer has shape (7,7,1024) and then is flatten to put into 2 fully connected layers which outputs 7x7x30 parameters and then reshapes to (7,7,30) as our wanted output.
- Nonmax suppression
In inference phase, YOLO network can output multiple bounding boxes for the same object and non-max suppression is a technique to resolve it. What non-max suppression do is to keep only the box having maximum confidence score among all boxes that have same class and have IoU (Intersect over Union)>0.5 with this box and eliminate the remaining boxes.
From the idea of YOLO that each grid cell can only predict one class, we can see that it would limit the number of nearby objects can be detected. It is a huge limitations of this algorithm. As a result, there are variants of YOLO to solve this problem: YOLOv2 and YOLOv3. YOLOv2 and YOLOv3 have an important modification of using the idea of anchor box in Faster-RCNN to improve the accuracy of the model. More detail information about the modification of these two algorithm can be found here.