Efficient DETR: Improving End-to-End Object Detector with Dense Prior (2104.01318v1)

Published 3 Apr 2021 in cs.CV

Abstract: The recently proposed end-to-end transformer detectors, such as DETR and Deformable DETR, have a cascade structure of stacking 6 decoder layers to update object queries iteratively, without which their performance degrades seriously. In this paper, we investigate that the random initialization of object containers, which include object queries and reference points, is mainly responsible for the requirement of multiple iterations. Based on our findings, we propose Efficient DETR, a simple and efficient pipeline for end-to-end object detection. By taking advantage of both dense detection and sparse set detection, Efficient DETR leverages dense prior to initialize the object containers and brings the gap of the 1-decoder structure and 6-decoder structure. Experiments conducted on MS COCO show that our method, with only 3 encoder layers and 1 decoder layer, achieves competitive performance with state-of-the-art object detection methods. Efficient DETR is also robust in crowded scenes. It outperforms modern detectors on CrowdHuman dataset by a large margin.

Authors (4)

Zhuyu Yao (4 papers)
Jiangbo Ai (2 papers)
Boxun Li (17 papers)
Chi Zhang (567 papers)

Citations (182)

View on Semantic Scholar

Summary

Efficient DETR: A New Approach in End-to-End Object Detection

The paper proposes "Efficient DETR," an enhanced framework for end-to-end object detection leveraging dense prior information. This work is grounded in the context of recent advancements in transformer-based detection models, notably DETR and Deformable DETR, which require numerous decoder layers to attain precision due to their exploitative architecture reliant on iterative object query updates. This approach is innovative in addressing the inefficiencies linked with the initialization of object containers, namely object queries and reference points.

Key Contributions

The primary contribution of this research lies in remodeling the object query initialization process within the DETR framework using dense detection principles. The proposed methodology takes advantage of dense detection paradigms to endow object containers with optimized initial states. This strategy significantly ameliorates performance and reduces computational complexity, enabling Efficient DETR to achieve competitive results using a vastly simplified decoder architecture.

Methodology

The Efficient DETR architecture comprises two integral components: dense and sparse detection. Unlike traditional approaches where elaborate iterative refinement is essential, Efficient DETR employs a combined detection head for both components. In essence, the dense part functions to propose candidates via region proposals, which then serve as the initialization basis for object queries and reference points in the sparse part. The sparse part further refines these initialized object containers with a single decoder layer, facilitating fast convergence and competitive performance metrics.

The dense component utilizes a ResNet architecture to extract multi-scale features, implementing a sliding-window strategy and class-specific prediction to handle object detection. Subsequently, top-k scoring proposals are selected to initialize object containers, effectively bridging the gap between 1-decoder and multi-decoder models.

Experimental Insights

Experimental evaluations were performed on the COCO dataset, demonstrating that Efficient DETR, with a mere 3 encoder layers and a single decoder layer, closely matches the performance of state-of-the-art models entailing more sophisticated and computationally intensive configurations. Specifically, with a ResNet50 backbone, Efficient DETR achieves a mean Average Precision (mAP) of 44.2%, rivaling modern detector paradigms. Furthermore, Efficient DETR demonstrated robustness in crowded scenes, outperforming contemporary models on the CrowdHuman dataset by a significant margin.

Implications and Future Directions

The Efficiency DETR framework holds several implications across the domain of object detection. The introduction of dense priors in the initialization of object queries paves the way for simplified network architectures without compromising accuracy. This innovation suggests potential advantages in terms of energy efficiency and processing speed, crucial for deploying detection models on resource-constrained hardware.

The paper advocates for further exploration into hybrid detection mechanisms that integrate dense and sparse detection principles. Future research could delve into optimizing encoder and decoder interplays further or extending this methodology to other complex scenarios involving variable object scales and densities. Additionally, exploring alternatives to transformer architectures for enhanced performance efficiency could present new vistas in computer vision applications.

In conclusion, Efficient DETR represents a significant stride towards more practical and resource-efficient object detection models, establishing a benchmark for hybrid detection strategies in computer vision.

PDF Markdown