DEYO: DETR with YOLO for End-to-End Object Detection (2402.16370v1)
Abstract: The training paradigm of DETRs is heavily contingent upon pre-training their backbone on the ImageNet dataset. However, the limited supervisory signals provided by the image classification task and one-to-one matching strategy result in an inadequately pre-trained neck for DETRs. Additionally, the instability of matching in the early stages of training engenders inconsistencies in the optimization objectives of DETRs. To address these issues, we have devised an innovative training methodology termed step-by-step training. Specifically, in the first stage of training, we employ a classic detector, pre-trained with a one-to-many matching strategy, to initialize the backbone and neck of the end-to-end detector. In the second stage of training, we froze the backbone and neck of the end-to-end detector, necessitating the training of the decoder from scratch. Through the application of step-by-step training, we have introduced the first real-time end-to-end object detection model that utilizes a purely convolutional structure encoder, DETR with YOLO (DEYO). Without reliance on any supplementary training data, DEYO surpasses all existing real-time object detectors in both speed and accuracy. Moreover, the comprehensive DEYO series can complete its second-phase training on the COCO dataset using a single 8GB RTX 4060 GPU, significantly reducing the training expenditure. Source code and pre-trained models are available at https://github.com/ouyanghaodong/DEYO.
- Yolov4: Optimal speed and accuracy of object detection. ArXiv, abs/2004.10934, 2020.
- Cascade r-cnn: Delving into high quality object detection. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6154–6162, 2017.
- End-to-end object detection with transformers. In Computer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing.
- Group detr: Fast training convergence with decoupled one-to-many label assignment. ArXiv, abs/2207.13085, 2022.
- Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Fast convergence of detr with spatially modulated co-attention. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3601–3610, 2021.
- Yolox: Exceeding yolo series in 2021. ArXiv, abs/2107.08430, 2021.
- Jocher Glenn. Yolov5 release v7.0. a. https://github.com/ultralytics/yolov5/tree/v7.0, 2022.
- Jocher Glenn. Yolov8. b. https://github.com/ultralytics/ultralytics/tree/main, 2023.
- Detrs with hybrid matching. ArXiv, abs/2207.13080, 2022.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998.
- Yolov6: A single-stage object detection framework for industrial applications. ArXiv, abs/2209.02976, 2022a.
- Dn-detr: Accelerate detr training by introducing query denoising. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13609–13617, 2022b.
- A dual weighting label assignment scheme for object detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9377–9386, 2022c.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
- Feature pyramid networks for object detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2016.
- Dab-detr: Dynamic anchor boxes are better queries for detr. In International Conference on Learning Representations.
- Path aggregation network for instance segmentation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8759–8768, 2018.
- Detrs beat yolos on real-time object detection. ArXiv, abs/2304.08069, 2023a.
- Detrs beat yolos on real-time object detection. ArXiv, abs/2304.08069, 2023b.
- Conditional detr for fast training convergence. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3631–3640, 2021.
- Yolo9000: Better, faster, stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6517–6525, 2016.
- Yolov3: An incremental improvement. ArXiv, abs/1804.02767, 2018.
- You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2015.
- Sparse detr: Efficient end-to-end object detection with learnable sparsity. ArXiv, abs/2111.14330, 2021.
- Crowdhuman: A benchmark for detecting human in a crowd. ArXiv, abs/1805.00123, 2018.
- Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. ArXiv, abs/2207.02696, 2022a.
- End-to-end object detection with fully convolutional network. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15844–15853, 2020.
- Anchor detr: Query design for transformer-based detector. ArXiv, abs/2109.07107, 2022b.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. ArXiv, abs/2203.03605, 2022.
- Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9756–9765, 2019.
- Progressive end-to-end object detection in crowded scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 857–866, 2022.
- Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations.
- Haodong Ouyang (4 papers)