A Review of Object Detection Models based on Convolutional Neural Network (1905.01614v3)

Published 5 May 2019 in cs.CV

Abstract: Convolutional Neural Network (CNN) has become the state-of-the-art for object detection in image task. In this chapter, we have explained different state-of-the-art CNN based object detection models. We have made this review with categorization those detection models according to two different approaches: two-stage approach and one-stage approach. Through this chapter, it has shown advancements in object detection models from R-CNN to latest RefineDet. It has also discussed the model description and training details of each model. Here, we have also drawn a comparison among those models.

Authors (3)

F. Sultana (3 papers)
A. Sufian (7 papers)
P. Dutta (16 papers)

Citations (179)

View on Semantic Scholar

Summary

Object Detection Models Based on Convolutional Neural Networks: A Technical Overview

Object detection, a fundamental task in computer vision, aims to identify and localize objects within images. Convolutional Neural Networks (CNNs) have emerged as state-of-the-art solutions for this challenge, offering significant advancements over traditional machine learning methods. This paper systematically reviews various CNN-based object detection models, categorizing them according to two distinct approaches: two-stage and one-stage methods. Furthermore, the paper provides a detailed comparison of their architectures, training methodologies, and performance metrics.

Two-Stage Approaches

Two-stage object detection models typically involve generating region proposals in the first stage and subsequently refining these to deliver precise object classifications and bounding box predictions in the second stage.

R-CNN: The Region-based Convolutional Network (R-CNN) laid the groundwork for the two-stage approach, utilizing selective search to generate region proposals followed by CNNs to extract features from these proposals and classify objects using SVMs. However, this model faced challenges in computational efficiency due to its multi-stage pipeline.
SPP-net: Spatial Pyramid Pooling Network (SPP-net) introduced a pooling layer allowing the handling of multi-scale inputs without resizing, enhancing scale invariance and computational efficiency over R-CNN.
Fast R-CNN: Fast R-CNN improved upon the efficiency of its predecessors by introducing Region of Interest (RoI) pooling directly on the convolutional feature map, enabling simultaneous classification and bounding box regression.
Faster R-CNN: This model integrated Region Proposal Networks (RPNs), eliminating external proposal generation methods, resulting in faster and more accurate detections.
FPN: Feature Pyramid Network (FPN) leveraged multiscale feature maps to enhance detection accuracy for objects of varying sizes.
Mask R-CNN: Building on Faster R-CNN, Mask R-CNN addressed instance segmentation, providing pixel-precise masks for detected objects using a modified RoI pooling technique called RoI Align.

One-Stage Approaches

One-stage methods unify detection into a single, streamlined step, optimizing both classification and bounding box prediction directly.

YOLO: You Only Look Once (YOLO) is renowned for achieving real-time object detection by predicting bounding boxes and class probabilities from the entire image in a single evaluation, though historically less accurate than two-stage models.
SSD: Single Shot Multibox Detector (SSD) incorporates predictions at multiple feature scales within the network, improving its handling of objects of diverse sizes and achieving a fine balance between speed and accuracy.
YOLO9000: This iteration enhances the basic YOLO framework, optimizing detection across over 9000 classes by integrating WordTree hierarchies and multi-scale training.
RetinaNet: This model introduced a novel focal loss function to address class imbalance, achieving competitive performance with dense object detection setups without sacrificing computational efficiency.
RefineDet: Combining ideas from both two-stage and one-stage methods, RefineDet improves detection accuracy by refining anchor boxes initially before offering final class predictions.

Comparative Analysis

The performance of object detection models is evaluated based on mean average precision (mAP) across different datasets, including PASCAL VOC and MS COCO. Two-stage models typically achieve higher accuracy due to their more thorough proposal refinement process. For instance, Fast R-CNN and Faster R-CNN show superior precision over YOLO variants, which excel in speed — a crucial factor for real-time applications. RefineDet achieves notable mAP, balancing precision with efficiency, indicating potential directions for future innovation.

Conclusion and Future Directions

The reviewed CNN-based object detection models reflect a trajectory of significant advancements, culminating in methods adept at precise, efficient object detection. Two-stage approaches prioritize accuracy while one-stage models emphasize real-time performance. The fusion of these approaches, alongside the integration of advanced techniques such as feature pyramids and novel loss functions, suggests fertile ground for future research aimed at overcoming existing limitations. There remains a promising horizon for innovations that will further enhance the precision, speed, and applicability of object detection models in diverse real-world settings.