Anchor DETR: Query Design for Transformer-Based Object Detection (2109.07107v2)

Published 15 Sep 2021 in cs.CV

Abstract: In this paper, we propose a novel query design for the transformer-based object detection. In previous transformer-based detectors, the object queries are a set of learned embeddings. However, each learned embedding does not have an explicit physical meaning and we cannot explain where it will focus on. It is difficult to optimize as the prediction slot of each object query does not have a specific mode. In other words, each object query will not focus on a specific region. To solved these problems, in our query design, object queries are based on anchor points, which are widely used in CNN-based detectors. So each object query focuses on the objects near the anchor point. Moreover, our query design can predict multiple objects at one position to solve the difficulty: "one region, multiple objects". In addition, we design an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR. Thanks to the query design and the attention variant, the proposed detector that we called Anchor DETR, can achieve better performance and run faster than the DETR with 10$\times$ fewer training epochs. For example, it achieves 44.2 AP with 19 FPS on the MSCOCO dataset when using the ResNet50-DC5 feature for training 50 epochs. Extensive experiments on the MSCOCO benchmark prove the effectiveness of the proposed methods. Code is available at \url{https://github.com/megvii-research/AnchorDETR}.

Authors (4)

Yingming Wang (5 papers)
Xiangyu Zhang (328 papers)
Tong Yang (154 papers)
Jian Sun (415 papers)

Citations (46)

View on Semantic Scholar

Summary

Anchor DETR: Query Design for Transformer-Based Object Detection

The paper "Anchor DETR: Query Design for Transformer-Based Object Detection" presents a refined approach to object detection using transformer architectures, specifically focusing on the design and optimization of query embeddings. This research aims to address limitations in existing transformer-based detectors, such as the Detection Transformer (DETR), which utilizes learned object queries that lack explicit physical meaning and positional clarity.

Methodology and Innovations

The authors introduce a novel query design grounded in anchor points, a concept traditionally employed in CNN-based detectors. By encoding these anchor points as object queries, each query gains an explicit physical meaning, focusing on objects proximate to the anchor points. This design alleviates the positional ambiguity inherent in previous methods, where object queries handled extensive areas without specific regional focus.

Additionally, the authors implement a mechanism allowing multiple object predictions per anchor point. This advancement addresses scenarios where multiple objects may occupy the same region, a common challenge in complex scenes. The attention mechanism is also enhanced through a variant termed Row-Column Decouple Attention (RCDA), which optimizes memory consumption while maintaining or improving performance compared to the standard attention mechanism in DETR.

Empirical Validation

The paper provides extensive empirical validation on the MS COCO dataset, demonstrating the prowess of Anchor DETR. Notably, the proposed method achieves 44.2 AP with 19 FPS using the ResNet50-DC5 feature at 50 training epochs. This performance is superior to DETR, which requires 500 epochs, thereby highlighting a significant reduction in training time and computational resources needed. The detailed experiments confirm the effectiveness of the anchor-based query design and the RCDA attention mechanism in improving both speed and accuracy.

Implications and Future Directions

The query design's interpretability and optimization ease present noteworthy practical implications, facilitating more efficient and explainable object detection systems. The elimination of traditional anchors and non-maximum suppression (NMS) further enhances the model's applicability in real-world scenarios. The RAM-free nature of the system also ensures greater compatibility with hardware deployment, a critical consideration for scalable applications.

Looking forward, the developments presented here could pave the way for new advancements in object detection using transformers. Future research might explore the integration of these techniques into more complex multi-task learning frameworks or their adaptation to video-based object detection where temporal consistency is required.

Overall, the paper contributes significantly to the body of work on transformer-based object detection, offering a method that is not only performant but also more aligned with practical deployment requirements.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - megvii-research/AnchorDETR: An official implementation of the Anchor DETR. (353 stars)