Object Detection with Deep Learning: A Review (1807.05511v2)

Published 15 Jul 2018 in cs.CV

Abstract: Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.

PDF Abstract

Object Detection with Deep Learning: A Review

The paper, "Object Detection with Deep Learning: A Review" by Zhong-Qiu Zhao et al., extensively surveys the state-of-the-art in deep learning for object detection. It explores the evolution from traditional methods to modern deep learning approaches, highlighting architectural innovations, training strategies, and optimization techniques. This comprehensive review is essential for experienced researchers focused on object detection and related neural network learning systems.

Introduction and Historical Context

Object detection is a fundamental computer vision task that involves localizing and classifying objects within an image. Traditional methods, such as the Deformable Part-based Model (DPM), leveraged low-level features and ensemble learning, which stagnated in performance due to computational inefficiencies and the inadequacies of hand-crafted features. The emergence of Deep Neural Networks (DNNs) has significantly advanced the field by introducing data-driven feature learning.

Convolutional Neural Networks (CNNs)

The review emphasizes the role of CNNs, which brought a paradigm shift in object detection by providing hierarchical feature representations. Classic architectures like AlexNet and VGG16 demonstrated that deeper networks could learn more complex and abstract features. The paper discusses various CNN-based methods:

R-CNN: Introduced region proposals and extracted CNN features using pre-trained networks, followed by SVM classification.
SPP-net: Addressed the fixed-size input limitation of R-CNN by introducing spatial pyramid pooling.
Fast R-CNN: Integrated region proposal, feature extraction, and detection into a single CNN with the Region of Interest (RoI) pooling layer.
Faster R-CNN: Enhanced Fast R-CNN with a Region Proposal Network (RPN) for generating region proposals, leading to end-to-end training.

Advances in Object Detection Techniques

The paper categorizes object detection frameworks into two types: region proposal-based and regression/classification-based:

Region Proposal-Based Frameworks: Methods like R-CNN, SPP-net, Fast R-CNN, and Faster R-CNN, focus on extracting region proposals and then classifying each proposal.
Regression/Classification-Based Frameworks: Methods such as YOLO and SSD dispense with region proposals and directly regress bounding boxes and class probabilities over a dense grid.

Detailed Architectures:

R-CNN: Involves selective search for region proposals, CNN-based feature extraction, and SVM classification.
SPP-net: Uses spatial pyramid pooling to allow for fixed-length feature encoding from variable-sized input regions.
Fast R-CNN: Employs RoI pooling on shared feature maps, leading to faster training.
Faster R-CNN: Introduces RPN for integrated proposal generation, reducing the computational overhead of region proposal generation.
YOLO: Divides the input image into a grid and predicts bounding boxes and class probabilities directly from image pixels.
SSD: Utilizes default anchor boxes with different aspect ratios and scales, making predictions from multiple feature maps of varying resolutions.

Specific Object Detection Tasks

The review extends the discussion to specialized detection tasks, illustrating adaptations of generic models:

Salient Object Detection: Combines deep learning with saliency maps to focus detection on regions of interest. Methods like DSSC and DSR integrate multi-scale deep features and recurrent neural networks (RNNs) for enhanced saliency prediction.
Face Detection: Explores adaptations of CNNs to handle variations in facial appearance and occlusions. Multi-task learning frameworks like MTCNN demonstrate improved performance through joint optimization of detection and alignment.
Pedestrian Detection: Stresses the importance of part-based detection and context modeling to deal with occlusions and small object sizes. Advanced methods include MS-CNN and CompACT-Deep, which fuse deep features with handcrafted ones.

Numerical Results and Comparative Analysis

The paper provides extensive experimental comparisons, demonstrating superior performance of deep learning-based frameworks on benchmark datasets like PASCAL VOC, Microsoft COCO, and Caltech Pedestrian. Highlights include:

YOLO: Achieves real-time detection rates, significantly outperforming prior real-time detectors.
Faster R-CNN: Consistently improves performance with deeper backbones and end-to-end training.
SSD: Balances speed and accuracy, excelling in detecting objects at various scales.

Implications and Future Directions

The review identifies both practical and theoretical implications of deep learning in object detection:

Multi-Task and Multi-Modal Learning: Encourages combining object detection with related tasks (e.g., segmentation, pose estimation) and fusing information from different data modalities for robust detection.
Scale Adaptation: Stresses handling scale variations through multi-scale feature maps and scale-invariant models.
Contextual Modeling: Highlights the benefit of leveraging spatial and contextual information to refine detections.
Real-Time Applications: Discusses the necessity of optimizing network architectures to enable real-time detection on constrained platforms.
Unsupervised Learning: Advocates for exploring unsupervised and weakly supervised approaches to reduce reliance on annotated datasets.

Conclusion

This review paper provides an in-depth analysis of deep learning-based object detection frameworks, illustrating advancements across various methodologies. It proposes significant future directions to further enhance detection performance and expand applications. This review serves as a valuable resource for researchers aiming to push the boundaries in object detection and related fields.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Zhong-Qiu Zhao (8 papers)
Peng Zheng (38 papers)
Shou-tao Xu (1 paper)
Xindong Wu (49 papers)

Citations (3,656)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos