DetNet: A Backbone network for Object Detection (1804.06215v2)

Published 17 Apr 2018 in cs.CV

Abstract: Recent CNN based object detectors, no matter one-stage methods like YOLO, SSD, and RetinaNe or two-stage detectors like Faster R-CNN, R-FCN and FPN are usually trying to directly finetune from ImageNet pre-trained models designed for image classification. There has been little work discussing on the backbone feature extractor specifically designed for the object detection. More importantly, there are several differences between the tasks of image classification and object detection. 1. Recent object detectors like FPN and RetinaNet usually involve extra stages against the task of image classification to handle the objects with various scales. 2. Object detection not only needs to recognize the category of the object instances but also spatially locate the position. Large downsampling factor brings large valid receptive field, which is good for image classification but compromises the object location ability. Due to the gap between the image classification and object detection, we propose DetNet in this paper, which is a novel backbone network specifically designed for object detection. Moreover, DetNet includes the extra stages against traditional backbone network for image classification, while maintains high spatial resolution in deeper layers. Without any bells and whistles, state-of-the-art results have been obtained for both object detection and instance segmentation on the MSCOCO benchmark based on our DetNet~(4.8G FLOPs) backbone. The code will be released for the reproduction.

PDF Abstract

Overview of "DetNet: A Backbone Network for Object Detection"

The paper "DetNet: A Backbone Network for Object Detection" introduces a novel convolutional neural network (CNN) architecture specifically optimized for object detection tasks. Unlike many conventional object detection systems that adapt backbone networks initially designed for image classification, DetNet is purposefully crafted to address the nuances and demands of object detection.

Key Contributions

DetNet’s design addresses specific challenges intrinsic to object detection, such as the need for larger spatial resolution and adequate receptive fields across different object scales. The authors highlight several prominent contributions of DetNet:

Dedicated Backbone Structure: DetNet incorporates additional stages compared to traditional networks like ResNet, allowing it to align more seamlessly with the feature pyramid architectures prevalent in advanced detectors such as FPN and RetinaNet. This design enables the leveraging of pre-training procedures for these extra stages, which is typically a limitation with existing backbones.
Spatial Resolution Maintenance: Across deeper layers, DetNet maintains high spatial resolution while encompassing large receptive fields. This dual focus facilitates better localization and recognition performance, particularly in large and small object scales.
Efficient Computation: The architecture employs a low-complexity dilated bottleneck structure, optimizing the balance between computational efficiency and detection accuracy. This approach demonstrates a foundational recognition of the tradeoffs between maintaining high spatial resolution and the associated memory and computational costs.

Experimental Results

DetNet achieves compelling results on the MSCOCO benchmark, showcasing its efficacy in both object detection and instance segmentation tasks. Key performance insights include:

DetNet-59, a variant of DetNet, outperforms ResNet-50 and even competes closely with the significantly more computationally expensive ResNet-101. This result underscores DetNet's efficiency in balancing complexity and performance.
In detailed evaluations, DetNet exhibits substantial improvements in average precision and recall, particularly with large objects, demonstrating its superior ability to maintain object boundary integrity at high scales.

Analysis of Structural Innovation

DetNet's advantage lies in its tailored approach to meet the intricate requirements of object detection. Its introduction of extra stages and maintenance of spatial resolution across layers signifies a profound shift from the typical reductionist approach of classification networks. The incorporation of a dilated bottleneck design further exemplifies a nuanced understanding of the interaction between convolutional operations and spatial feature maps.

Implications and Future Directions

The DetNet framework prompts vital considerations for designing backbone networks that cater to specific tasks, such as object detection, which intrinsically differ from image classification. By bridging these gaps, DetNet sets a precedent for further exploration into custom-tailored neural network designs outside the broadly adopted multi-purpose backbones.

Future pathways could see DetNet's architectural philosophy applied to other complex visual tasks, such as video instance segmentation or real-time multi-object tracking. Additionally, exploration into multi-task learning could benefit from DetNet's principles, efficiently sharing learned representations across tasks while preserving task-specific performance.

In conclusion, DetNet represents a significant step forward in specialized network design, directly addressing limitations in traditional backbones through a thoughtful and efficient restructuring tailored to object detection demands.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Zeming Li (53 papers)
Chao Peng (66 papers)
Gang Yu (114 papers)
Xiangyu Zhang (328 papers)
Yangdong Deng (8 papers)
Jian Sun (414 papers)

Citations (253)

View on Semantic Scholar