Conditional DETR for Fast Training Convergence (2108.06152v3)
Abstract: The recently-developed DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities and predicting the box, which increases the need for high-quality content embeddings and thus the training difficulty. Our approach, named conditional DETR, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention. The benefit is that through the conditional spatial query, each cross-attention head is able to attend to a band containing a distinct region, e.g., one object extremity or a region inside the object box. This narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training. Empirical results show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101. Code is available at https://github.com/Atten4Vis/ConditionalDETR.
- Yolov4: Optimal speed and accuracy of object detection. CoRR, abs/2004.10934, 2020.
- Cascade R-CNN: delving into high quality object detection. In CVPR, 2018.
- End-to-end object detection with transformers. In ECCV, 2020.
- Dynamic convolution: Attention over convolution kernels. In CVPR, 2020.
- UP-DETR: unsupervised pre-training for object detection with transformers. CoRR, abs/2011.09094, 2020.
- Centernet: Keypoint triplets for object detection. In ICCV, 2019.
- Fast convergence of DETR with spatially modulated co-attention. CoRR, abs/2101.07448, 2021.
- Bottom-up human pose estimation via disentangled keypoint regression. In CVPR, pages 14676–14686, June 2021.
- Ross B. Girshick. Fast R-CNN. In ICCV, 2015.
- Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
- Gaussian transformer: A lightweight approach for natural language inference. In AAAI, 2019.
- Deep residual learning for image recognition. In CVPR, 2016.
- Gather-excite: Exploiting feature context in convolutional neural networks. In NeurIPS, 2018.
- Squeeze-and-excitation networks. In CVPR, 2018.
- Densebox: Unifying landmark localization with end to end object detection. CoRR, abs/1509.04874, 2015.
- Dynamic filter networks. In NeurIPS, 2016.
- Rethinking positional encoding in language pre-training. CoRR, abs/2006.15595, 2020.
- T-GSA: transformer with gaussian-weighted self-attention for speech enhancement. In ICASSP, 2020.
- Foveabox: Beyond anchor-based object detector. CoRR, abs/1904.03797, 2019.
- Harold W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1995.
- Cornernet: Detecting objects as paired keypoints. In ECCV, 2018.
- Cornernet-lite: Efficient keypoint based object detection. In BMVC. BMVA Press, 2020.
- Scale-aware trident networks for object detection. In ICCV, pages 6054–6063, 2019.
- Focal loss for dense object detection. TPAMI, 2020.
- Microsoft COCO: common objects in context. In ECCV, 2014.
- SSD: single shot multibox detector. In ECCV, 2016.
- Fixing weight decay regularization in adam. In ICLR, 2017.
- Grid R-CNN. In CVPR, 2019.
- Libra R-CNN: towards balanced learning for object detection. In CVPR, 2019.
- You only look once: Unified, real-time object detection. In CVPR, 2016.
- YOLO9000: better, faster, stronger. In CVPR, 2017.
- Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018.
- Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI, 2017.
- Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
- Revisiting the sibling head in object detector. In CVPR, 2020.
- Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
- Rethinking transformer-based set prediction for object detection. CoRR, abs/2011.10881, 2020.
- Conditional convolutions for instance segmentation. In ECCV, 2020.
- FCOS: fully convolutional one-stage object detection. In ICCV, 2019.
- Attention is all you need. In NeurIPS, 2017.
- Deep high-resolution representation learning for visual recognition. TPAMI, 2019.
- Solov2: Dynamic and fast instance segmentation. In NeurIPS, 2020.
- Line segment detection using transformers without edges. In CVPR, pages 4257–4266, June 2021.
- Condconv: Conditionally parameterized convolutions for efficient inference. In NeurIPS, 2019.
- Lite-hrnet: A lightweight high-resolution network. In CVPR, pages 10440–10450, June 2021.
- Unitbox: An advanced object detection network. In MM, 2016.
- Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR, 2020.
- End-to-end object detection with adaptive clustering transformer. CoRR, abs/2011.09315, 2020.
- Objects as points. CoRR, abs/1904.07850, 2019.
- Bottom-up object detection by grouping extreme and center points. In CVPR, 2019.
- Soft anchor-point object detection. In ECCV, 2020.
- Feature selective anchor-free module for single-shot object detection. In CVPR, 2019.
- Deformable DETR: deformable transformers for end-to-end object detection. CoRR, abs/2010.04159, 2020.
- Depu Meng (7 papers)
- Xiaokang Chen (39 papers)
- Zejia Fan (4 papers)
- Gang Zeng (40 papers)
- Houqiang Li (236 papers)
- Yuhui Yuan (42 papers)
- Lei Sun (138 papers)
- Jingdong Wang (236 papers)