Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conditional DETR for Fast Training Convergence (2108.06152v3)

Published 13 Aug 2021 in cs.CV

Abstract: The recently-developed DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities and predicting the box, which increases the need for high-quality content embeddings and thus the training difficulty. Our approach, named conditional DETR, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention. The benefit is that through the conditional spatial query, each cross-attention head is able to attend to a band containing a distinct region, e.g., one object extremity or a region inside the object box. This narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training. Empirical results show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101. Code is available at https://github.com/Atten4Vis/ConditionalDETR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Yolov4: Optimal speed and accuracy of object detection. CoRR, abs/2004.10934, 2020.
  2. Cascade R-CNN: delving into high quality object detection. In CVPR, 2018.
  3. End-to-end object detection with transformers. In ECCV, 2020.
  4. Dynamic convolution: Attention over convolution kernels. In CVPR, 2020.
  5. UP-DETR: unsupervised pre-training for object detection with transformers. CoRR, abs/2011.09094, 2020.
  6. Centernet: Keypoint triplets for object detection. In ICCV, 2019.
  7. Fast convergence of DETR with spatially modulated co-attention. CoRR, abs/2101.07448, 2021.
  8. Bottom-up human pose estimation via disentangled keypoint regression. In CVPR, pages 14676–14686, June 2021.
  9. Ross B. Girshick. Fast R-CNN. In ICCV, 2015.
  10. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
  11. Gaussian transformer: A lightweight approach for natural language inference. In AAAI, 2019.
  12. Deep residual learning for image recognition. In CVPR, 2016.
  13. Gather-excite: Exploiting feature context in convolutional neural networks. In NeurIPS, 2018.
  14. Squeeze-and-excitation networks. In CVPR, 2018.
  15. Densebox: Unifying landmark localization with end to end object detection. CoRR, abs/1509.04874, 2015.
  16. Dynamic filter networks. In NeurIPS, 2016.
  17. Rethinking positional encoding in language pre-training. CoRR, abs/2006.15595, 2020.
  18. T-GSA: transformer with gaussian-weighted self-attention for speech enhancement. In ICASSP, 2020.
  19. Foveabox: Beyond anchor-based object detector. CoRR, abs/1904.03797, 2019.
  20. Harold W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1995.
  21. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018.
  22. Cornernet-lite: Efficient keypoint based object detection. In BMVC. BMVA Press, 2020.
  23. Scale-aware trident networks for object detection. In ICCV, pages 6054–6063, 2019.
  24. Focal loss for dense object detection. TPAMI, 2020.
  25. Microsoft COCO: common objects in context. In ECCV, 2014.
  26. SSD: single shot multibox detector. In ECCV, 2016.
  27. Fixing weight decay regularization in adam. In ICLR, 2017.
  28. Grid R-CNN. In CVPR, 2019.
  29. Libra R-CNN: towards balanced learning for object detection. In CVPR, 2019.
  30. You only look once: Unified, real-time object detection. In CVPR, 2016.
  31. YOLO9000: better, faster, stronger. In CVPR, 2017.
  32. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018.
  33. Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI, 2017.
  34. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
  35. Revisiting the sibling head in object detector. In CVPR, 2020.
  36. Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
  37. Rethinking transformer-based set prediction for object detection. CoRR, abs/2011.10881, 2020.
  38. Conditional convolutions for instance segmentation. In ECCV, 2020.
  39. FCOS: fully convolutional one-stage object detection. In ICCV, 2019.
  40. Attention is all you need. In NeurIPS, 2017.
  41. Deep high-resolution representation learning for visual recognition. TPAMI, 2019.
  42. Solov2: Dynamic and fast instance segmentation. In NeurIPS, 2020.
  43. Line segment detection using transformers without edges. In CVPR, pages 4257–4266, June 2021.
  44. Condconv: Conditionally parameterized convolutions for efficient inference. In NeurIPS, 2019.
  45. Lite-hrnet: A lightweight high-resolution network. In CVPR, pages 10440–10450, June 2021.
  46. Unitbox: An advanced object detection network. In MM, 2016.
  47. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR, 2020.
  48. End-to-end object detection with adaptive clustering transformer. CoRR, abs/2011.09315, 2020.
  49. Objects as points. CoRR, abs/1904.07850, 2019.
  50. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019.
  51. Soft anchor-point object detection. In ECCV, 2020.
  52. Feature selective anchor-free module for single-shot object detection. In CVPR, 2019.
  53. Deformable DETR: deformable transformers for end-to-end object detection. CoRR, abs/2010.04159, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Depu Meng (7 papers)
  2. Xiaokang Chen (39 papers)
  3. Zejia Fan (4 papers)
  4. Gang Zeng (40 papers)
  5. Houqiang Li (236 papers)
  6. Yuhui Yuan (42 papers)
  7. Lei Sun (138 papers)
  8. Jingdong Wang (236 papers)
Citations (515)

Summary

An Examination of Conditional DETR for Enhanced Training Convergence in Object Detection

The paper, "Conditional DETR for Fast Training Convergence," introduces an innovative approach to improving training speed in object detection models that leverage the DEtection TRansformer (DETR) architecture. The authors focus on addressing the slow convergence issue inherent in DETR by introducing a novel conditional cross-attention mechanism. This mechanism is pivotal for accelerating the training process without sacrificing performance. The research primarily targets the dependency on high-quality content embeddings for box prediction and classification, which is a bottleneck in the original DETR framework.

Methodological Advancements

The core contribution of the paper is the introduction of the conditional DETR framework, which employs a conditional spatial query mechanism. This mechanism is strategically designed to learn spatial queries based on decoder embeddings for improved decoder multi-head cross-attention. By narrowing the spatial focus of each cross-attention head to specific regions, such as object extremities or internal regions of the bounding box, the model reduces reliance on content embedding quality, thereby facilitating faster training.

The conditional spatial query is generated through a linear projection of the output embeddings from the decoder, effectively mapping displacement and scaling information to the embedding space. This transformation enables more precise localization of extremities and regions pertinent to classification and bounding box regression.

Empirical Evaluation

The authors demonstrate the efficacy of conditional DETR through empirical validation using the COCO 2017 dataset. Notably, for standard backbones R50 and R101, the conditional DETR achieves convergence approximately 6.7 times faster than the original DETR. For the more robust backbones, DC5-R50 and DC5-R101, the speedup increases to 10 times. These results underscore the significant improvement in training duration—achieving performance akin to 500 epochs of traditional DETR in merely 50 epochs for some configurations.

Comparisons and Practical Implications

Comparative analyses reveal that conditional DETR not only accelerates training but also performs competitively against single-scale DEformable TRansformer (DETR) variants such as UP-DETR and deformable DETR-SS. This positions conditional DETR as a promising candidate for real-time applications where both accuracy and rapid training are critical.

Although the conditional DETR does not incorporate multi-scale attention or 8× resolution enhancements as seen in other advanced DETR variants, its performance remains commendably close. This highlights the potential of integrating conditional cross-attention with multi-scale techniques to further enhance object detection frameworks.

Conclusion and Future Directions

The conditional cross-attention mechanism proposed in the paper effectively reduces the computational demands of object detection models by optimizing spatial query formation, thus easing the training burden. This approach paves the way for more efficient training paradigms in transformer-based architectures.

Future research could explore the application of conditional cross-attention in other domains such as human pose estimation and line segment detection. Additionally, synergizing this mechanism with multi-scale or higher-resolution inputs holds potential to push the boundaries of real-time object detection without the extensive computational overhead typical of current high-performance models.