Conditional DETR for Fast Training Convergence (2108.06152v3)

Published 13 Aug 2021 in cs.CV

Abstract: The recently-developed DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities and predicting the box, which increases the need for high-quality content embeddings and thus the training difficulty. Our approach, named conditional DETR, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention. The benefit is that through the conditional spatial query, each cross-attention head is able to attend to a band containing a distinct region, e.g., one object extremity or a region inside the object box. This narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training. Empirical results show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101. Code is available at https://github.com/Atten4Vis/ConditionalDETR.

View on arXiv

References (53)

Authors (8)

Depu Meng (7 papers)
Xiaokang Chen (39 papers)
Zejia Fan (4 papers)
Gang Zeng (40 papers)
Houqiang Li (236 papers)
Yuhui Yuan (42 papers)
Lei Sun (138 papers)
Jingdong Wang (236 papers)

Citations (515)

View on Semantic Scholar

Summary

An Examination of Conditional DETR for Enhanced Training Convergence in Object Detection

The paper, "Conditional DETR for Fast Training Convergence," introduces an innovative approach to improving training speed in object detection models that leverage the DEtection TRansformer (DETR) architecture. The authors focus on addressing the slow convergence issue inherent in DETR by introducing a novel conditional cross-attention mechanism. This mechanism is pivotal for accelerating the training process without sacrificing performance. The research primarily targets the dependency on high-quality content embeddings for box prediction and classification, which is a bottleneck in the original DETR framework.

Methodological Advancements

The core contribution of the paper is the introduction of the conditional DETR framework, which employs a conditional spatial query mechanism. This mechanism is strategically designed to learn spatial queries based on decoder embeddings for improved decoder multi-head cross-attention. By narrowing the spatial focus of each cross-attention head to specific regions, such as object extremities or internal regions of the bounding box, the model reduces reliance on content embedding quality, thereby facilitating faster training.

The conditional spatial query is generated through a linear projection of the output embeddings from the decoder, effectively mapping displacement and scaling information to the embedding space. This transformation enables more precise localization of extremities and regions pertinent to classification and bounding box regression.

Empirical Evaluation

The authors demonstrate the efficacy of conditional DETR through empirical validation using the COCO 2017 dataset. Notably, for standard backbones R50 and R101, the conditional DETR achieves convergence approximately 6.7 times faster than the original DETR. For the more robust backbones, DC5-R50 and DC5-R101, the speedup increases to 10 times. These results underscore the significant improvement in training duration—achieving performance akin to 500 epochs of traditional DETR in merely 50 epochs for some configurations.

Comparisons and Practical Implications

Comparative analyses reveal that conditional DETR not only accelerates training but also performs competitively against single-scale DEformable TRansformer (DETR) variants such as UP-DETR and deformable DETR-SS. This positions conditional DETR as a promising candidate for real-time applications where both accuracy and rapid training are critical.

Although the conditional DETR does not incorporate multi-scale attention or 8× resolution enhancements as seen in other advanced DETR variants, its performance remains commendably close. This highlights the potential of integrating conditional cross-attention with multi-scale techniques to further enhance object detection frameworks.

Conclusion and Future Directions

The conditional cross-attention mechanism proposed in the paper effectively reduces the computational demands of object detection models by optimizing spatial query formation, thus easing the training burden. This approach paves the way for more efficient training paradigms in transformer-based architectures.

Future research could explore the application of conditional cross-attention in other domains such as human pose estimation and line segment detection. Additionally, synergizing this mechanism with multi-scale or higher-resolution inputs holds potential to push the boundaries of real-time object detection without the extensive computational overhead typical of current high-performance models.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - Atten4Vis/ConditionalDETR: This repository is an official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence". (https://arxiv.org/abs/2108.06152) (347 stars)