DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR (2201.12329v4)

Published 28 Jan 2022 in cs.CV

Abstract: We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. Using box coordinates not only helps using explicit positional priors to improve the query-to-feature similarity and eliminate the slow training convergence issue in DETR, but also allows us to modulate the positional attention map using the box width and height information. Such a design makes it clear that queries in DETR can be implemented as performing soft ROI pooling layer-by-layer in a cascade manner. As a result, it leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting, e.g., AP 45.7\% using ResNet50-DC5 as backbone trained in 50 epochs. We also conducted extensive experiments to confirm our analysis and verify the effectiveness of our methods. Code is available at \url{https://github.com/SlongLiu/DAB-DETR}.

PDF Abstract

An Analysis of DAB-DETR: Dynamic Anchor Boxes for Transformer-Based Object Detection

The paper "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR" proposes a significant modification to the existing DETR (DEtection TRansformer) framework by introducing dynamic anchor boxes as queries. This approach aims to enhance query formulation and training convergence. In this essay, we will explore the methodologies, results, and implications presented by the authors.

Methodological Advancement

DETR revolutionizes object detection by utilizing a Transformer-based architecture that foregoes traditional hand-crafted components like anchors. However, it suffers from slow training convergence, necessitating 500 epochs to achieve competitive performance due to its ineffective query design. The paper posits that adopting dynamic anchor boxes—four-dimensional coordinates $(x, y, w, h)$ —as queries for Transformer decoders can address this inefficiency.

Key Innovations

Layer-by-Layer Anchor Updates: Anchor boxes serve as explicit positional priors, dynamically updated across Transformer decoder layers. This successive refinement approach aids in centering the anchors on target objects effectively.
Positional Attention Modulation: The cross-attention mechanism is modulated by using anchor box dimensions, enabling it to accommodate object scales. This feature addresses the challenge of fixed-size positional priors that can misalign with objects of varying sizes.
Temperature Adjustment: The authors introduce a modifiable temperature parameter in positional encoding, allowing better control over the flatness of attention maps to suit various object scales.

Experimental Results

The proposed DAB-DETR model exhibited the highest performance among DETR-like architectures on the MS-COCO benchmark. Notably, the model achieved an average precision (AP) of 45.7% using a ResNet-50-DC5 backbone over 50 training epochs. The improvement suggests that dynamic anchor boxes provide a more effective spatial prior for feature pooling, leading to both faster convergence and higher detection accuracy. These results are statistically validated, demonstrating consistent performance gains across various experimental setups.

Theoretical and Practical Implications

Theoretically, this paper provides a deeper understanding of query roles within the DETR framework. It challenges existing perspectives by demonstrating that queries can be effectively represented by low-dimensional spatial information, offering streamlined design and interpretability.

Practically, the reduced training time and improved detection accuracy have significant implications for real-world applications where computational resources and time are critical. The architecture aligns with modern requirements for efficient, end-to-end object detection frameworks that are robust across diverse real-time scenarios.

Future Directions in AI

This research opens several avenues for further exploration:

Integration with Multi-Scale Architectures: Applying dynamic anchor boxes in conjunction with multi-scale feature extraction could further enhance detection performance for objects at varying scales.
Cross-Domain Applications: The methodology can be generalized to domains such as medical imaging, where accurate localization and scalability are paramount.
Advanced Attention Mechanisms: Further exploration into adaptable attention mechanisms might yield improvements, potentially integrating learned positional priors.

In summary, the DAB-DETR paper contributes substantially to the object detection field by innovating upon the DETR structure with dynamic anchor boxes—a methodological enhancement showing promise in both theoretical exploration and practical application.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Shilong Liu (60 papers)
Feng Li (286 papers)
Hao Zhang (947 papers)
Xiao Yang (158 papers)
Xianbiao Qi (38 papers)
Hang Su (224 papers)
Jun Zhu (424 papers)
Lei Zhang (1689 papers)

Citations (613)

View on Semantic Scholar