- IntroductionIn such production industries as agriculture, food and beverages, and confectionery, detection and elimination of pests is one of most important concerns with the purpose of increasing food security, improving quality of products and reducing the cost of production . Without the application of computer vision, especially artificial intelligence, a great deal of human manual effort must usually be utilised in order to achieve effective pest identification. In commonly used trap-based pest monitoring process, captured digital images are analysed by human experts for recognizing and counting pests[ 2],. Relying on such manual method is time-consuming, costly and certainly unavoidable of mistakes. The solution presented in this paper is derived from a real business requirement of our company in Vietnam. Our customer is a Japanese firm specialized in confectionery production and agriculture. They need an automated system which can identify and classify distinguished types of toxic insects, and then based on the results they can decide up on appropriate solutions to get rid of them.
In reference to some previous work dealing with similar problem, some methods were already put in practice and brought about quite good results within their own dataset. In section 3 , two typical papers will be briefly described in order of problem, method and result. The paper presented in  combined convolution neural network and “sliding windows” approach, resulting in around 80% rate of object evaluation and 95% rate of image evaluation on their custom dataset. On the other hand, research presented in  applied the very basic image processing technique Adaptive Thresholding and managed to get 97% rate of pest detection on their own dataset. You can easily see that they both produced more than adequate results; nevertheless, the utilised methods are not so suitable to deal with our problem, which reason will be better detailed later on in the same section. Although those two methods are not well applicable to our problem they can serve as good practice for us to compare and find out the most effective detecting method. Moreover, due to the development of deep learning we can now also rely on really fast and highly accurate object detection methods, namely SSD, YOLO, Faster RCNN, and so on,. In this paper, we will examine both the experimented methods from previous work and newly developed methods in order to find out the most effective and accurate one to help solve our problem.
- Previous research
This section presents two most recent pieces of work on a roughly similar problem to ours on the purpose of clarifying differences between this paper and some other pieces of work on detecting and classifying insects.
The first one named “Automatic moth detection from trap images for pest management” (Weiguang Ding, Graham Taylor, 2016) . It also tried to identify insects on traps using Convolutional Neural Network method ; however, the approach is not quite applicable to our problem when looking into more details. Firstly, we deal with different scopes. The main task of Weiguang Ding and Graham Taylor is focus on detect and classify only one kind of moth, meanwhile our attention focuses on distinguishing 6 types of pests. Secondly in terms of methodology, they applied sliding windows in combination with convolution network methods to predict the objects. That the method could produce good prediction results specifically to their problem can be reasonably explained. For only one species of moth, objects need detecting had nearly same size; therefore, a fixed sliding windows size was able to be chosen. On the other hand, our problem cannot be solved in the same way since we have diversified insects, some of them are rather big, but the others may be really small. It parallels with the fact that different types of insects have different image ratios. In this case, it is also feasible if we adopt the same method by using multiple-sized windows of various aspect ratios, but we have to trade off with speed and efficiency. In addition, they use neuron network architecture similar to lenet and train from scratch in which a fair result requires a considerable amount of training data to prevent overfitting . For all those reasons, in this research we try to find another applicable method which can simultaneously detect multiple kinds of pests in a reasonable period of time for training and predicting, perform in real time and does not require that much training data.
The second research to be described was conducted by three colleagues Yogesh Kumar, Ashwani Kumar Dubey and Adityan Jothi, focusing on pest detection using adaptive thresholding . The adaptive thresholding method is combined with noise filter  to serve detection. As can be seen from the paper, the authors used a simple image processing method which seemed to be suitable with fairly simple background in their example. Since the method produced a remarkable 97% rate of pest detection, we will try adopting this method in solving our problem with more complex background (a typical feature of our dataset) to see whether it can perform effectively. Besides, various object detection methods, namely CNN and its possible variants, have been newly developed thanks to deep learning advance in computer vision. With the fact that they have been steadily proved to bring about superior performance even when applied in diversified dataset to address multiple problems [reference], we believe that one of them can become a star in simultaneously identifying and detecting a wide range of pests. The chosen one to be applied and tested in this paper’s detection pipeline is SSD  method which possess many suitable features to help solve our problem within our custom dataset (details features will be presented in section 4).
- Data collection
In this section, we present three main steps of preparing training data: collection, Augmentation and preprocessing.
3.1. Data collection
Raw data (unprocessed images) were collected by simply capturing photos of insects deceased on traps in real sets of environment. The photos were officially provided by our customer company who required the problem to be solved. Their traps had special insect-attracting compounds as well as sticky glue to catch insects, and a camera automatically taking photos at the end of the day.
In order to keep most properties of the photos, we saved them in JPEG format, colored images (RBG) at resolution of 1920*1080 pixel. Each insect will then be marked by drawing a bounding box (or ground truth box) by using BBOX LabelTool , together with a name label identified by insect experts, who will thereafter assisting in verifying research results (example photo below).
Particularly to this research, we concentrate on identifying 5 main types of pests, including choubae_ka, shoujoubae_ka , yusurika_ka , kurobanekinokobae_ka and nomibae_ka. In addition, we decided to use a special label named “others” to classify those exclusive of the six kinds above. There are three reasons for this decision. First, our priority is to identify only common pests negatively influencing the environment and production procedure. Second, it will be very time-consuming if we classify all in a wide range of insects/pest; and moreover, unpopular types of insects will possibly increases chance of mistakes. Third, that the class “others” is diverse and includes a lot of insects’ shapes truly makes up a more attractive classification challenge.
3.2. Data Augmentation
As mentioned above, we have only 200 original photos 3,000 insects. It is commonly known that deep learning model needs lots of training data to produce good results ; as a result, we do the data Augmentation to increase that number . The 200 original photos are divided into 2 groups; one containing 150 photos is called training set and the other of 50 is testing set. Only one photo was taken at the end of each day; therefore, they usually capture images in different atmosphere condition, light and numbers of trapped insects. We will then do the data multiplication on both sets. We combine flipping and rotating orders on each photo to get more photos (we will run a program to automatically mark bounding boxes again on new images) for training and testing. After completing the augmentation process, we have totally around 1800 photos for training and 600 ones for testing.
Original image with bounding box
New image with new bounding box after rotation (Data augmentation)
Dataset Total choubae_ka shoujoubae_ka yusurika_ka kurobanekinokobae_ka nomibae_ka others Training 150 227 313 197 201 237 1045 Test 50 87 67 91 71 56 434
Table of original data
Dataset Total choubae_ka shoujoubae_ka yusurika_ka kurobanekinokobae_ka nomibae_ka others Training 1800 2724 3756 2364 2412 2844 12540 Test 600 1044 804 1092 852 672 5280
Table of data after data augmentation
3.3. Data preprocessing
This step involves processing color of the photos after data Augmentation. Since those photos were captured in real production environment and atmosphere, light condition is likely to be affected by trap locations as well as automatic capturing time during the day (example photos). Therefore, in order to limit the negative effect of light condition on identification capacity, we utilise the white balance methods “grey-world”. The method assumes that average value of red (R), green (G) and blue (B) channel are equal. It is clearly seen in paper . The preprocessing output is presented as followed.