DetNAS: Backbone Search for Object Detection (1903.10979v4)

Published 26 Mar 2019 in cs.CV

Abstract: Object detectors are usually equipped with backbone networks designed for image classification. It might be sub-optimal because of the gap between the tasks of image classification and object detection. In this work, we present DetNAS to use Neural Architecture Search (NAS) for the design of better backbones for object detection. It is non-trivial because detection training typically needs ImageNet pre-training while NAS systems require accuracies on the target detection task as supervisory signals. Based on the technique of one-shot supernet, which contains all possible networks in the search space, we propose a framework for backbone search on object detection. We train the supernet under the typical detector training schedule: ImageNet pre-training and detection fine-tuning. Then, the architecture search is performed on the trained supernet, using the detection task as the guidance. This framework makes NAS on backbones very efficient. In experiments, we show the effectiveness of DetNAS on various detectors, for instance, one-stage RetinaNet and the two-stage FPN. We empirically find that networks searched on object detection shows consistent superiority compared to those searched on ImageNet classification. The resulting architecture achieves superior performance than hand-crafted networks on COCO with much less FLOPs complexity.

PDF Abstract

DetNAS: Backbone Search for Object Detection

The paper "DetNAS: Backbone Search for Object Detection," authored by Yukang Chen et al., introduces a novel approach to improving object detection models by designing optimized backbone networks specifically tailored for object detection tasks using Neural Architecture Search (NAS). Traditional object detection systems often employ backbone architectures originally designed for image classification. This paper argues that such backbones may not be optimal for detection tasks due to the distinct nature of these tasks—while classification focuses on determining the main object in an image, detection requires locating and classifying multiple object instances.

Methodology

The authors present a comprehensive framework called DetNAS, which leverages the concept of one-shot supernets, a common NAS technique. This approach allows the representation of all possible network architectures within a defined search space, facilitating efficient NAS processes even in complex tasks like object detection. The framework consists of three main stages:

Supernet Pre-training: The supernet, encompassing all candidate architectures, is initially trained on the ImageNet classification dataset using a path-wise manner. This ensures the effective utilization of pre-trained weights, which are crucial for the quality of subsequent detections.
Supernet Fine-tuning: The trained supernet is then fine-tuned on the target detection datasets. The authors note the importance of employing Synchronized Batch Normalization (SyncBN) to maintain normalization quality despite the varying feature dimensions in different network paths and the limited batch sizes inherent in high-resolution image processing typical of detection tasks.
Architecture Search: Guided by an Evolutionary Algorithm (EA), the search identifies optimal backbones by evaluating the candidate architectures derived from the trained supernet on detection performance metrics using subsets of the detection dataset.

Experimental Results

The DetNAS framework has been validated across multiple configurations and datasets, including the COCO and VOC datasets, demonstrating significant improvements in detection performance over traditional hand-crafted networks such as ResNet and ShuffleNetv2. Particularly notable is the performance of DetNASNet, which surpasses ResNet-101 in mean Average Precision (mAP) while maintaining significantly lower computational complexity, indicated by fewer FLOPs.

Rigorous ablation studies reveal that NAS directly performed on object detection tasks yields superior results compared to NAS based on proxy tasks like image classification. Moreover, one critical observation is that networks designed for object detection exhibited larger kernel sizes in lower layers and deeper structures in higher layers—patterns not prevalent in networks optimized for image classification, highlighting the architectural divergence driven by task-specific demands.

Implications and Future Work

The findings in this paper have substantial implications for the future of neural network design in object detection. By emphasizing the fundamental differences between classification and detection tasks, it suggests that traditional backbone designs might be significantly enhancing its efficiency and performance when tailored using NAS for specific tasks.

Future work could explore expanding the search space or incorporating additional architectural innovations such as attention mechanisms into the DetNAS framework. Furthermore, adapting DetNAS for real-time applications could involve balancing between detection accuracy and inference speed, potentially opening new avenues toward more sophisticated NAS techniques tailored for various contexts in computer vision.

Overall, this paper contributes significantly to the ongoing evolution of deep learning in computer vision, providing a robust methodology to address the unique challenges of object detection through architecture search.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yukang Chen (43 papers)
Tong Yang (154 papers)
Xiangyu Zhang (328 papers)
Gaofeng Meng (41 papers)
Xinyu Xiao (11 papers)
Jian Sun (415 papers)

Citations (254)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - megvii-model/DetNAS (297 stars)