Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DEYO: DETR with YOLO for Step-by-Step Object Detection (2211.06588v3)

Published 12 Nov 2022 in cs.CV

Abstract: Object detection is an important topic in computer vision, with post-processing, an essential part of the typical object detection pipeline, posing a significant bottleneck affecting the performance of traditional object detection models. The detection transformer (DETR), as the first end-to-end target detection model, discards the requirement of manual components like the anchor and non-maximum suppression (NMS), significantly simplifying the target detection process. However, compared with most traditional object detection models, DETR converges very slowly, and a query's meaning is obscure. Thus, inspired by the Step-by-Step concept, this paper proposes a new two-stage object detection model, named DETR with YOLO (DEYO), which relies on a progressive inference to solve the above problems. DEYO is a two-stage architecture comprising a classic target detection model and a DETR-like model as the first and second stages, respectively. Specifically, the first stage provides high-quality query and anchor feeding into the second stage, improving the performance and efficiency of the second stage compared to the original DETR model. Meanwhile, the second stage compensates for the performance degradation caused by the first stage detector's limitations. Extensive experiments demonstrate that DEYO attains 50.6 AP and 52.1 AP in 12 and 36 epochs, respectively, while utilizing ResNet-50 as the backbone and multi-scale features on the COCO dataset. Compared with DINO, an optimal DETR-like model, the developed DEYO model affords a significant performance improvement of 1.6 AP and 1.2 AP in two epoch settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Yolov4: Optimal speed and accuracy of object detection. ArXiv, abs/2004.10934, 2020.
  2. Soft-nms — improving object detection with one line of code. 2017 IEEE International Conference on Computer Vision (ICCV), pages 5562–5570, 2017.
  3. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing.
  4. Hybrid task cascade for instance segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4969–4978, 2019.
  5. Group detr: Fast training convergence with decoupled one-to-many label assignment. ArXiv, abs/2207.13085, 2022.
  6. Dynamic head: Unifying object detection heads with attentions. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7369–7378, 2021.
  7. Dynamic detr: End-to-end object detection with dynamic attention. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2968–2977, 2021.
  8. Centernet: Keypoint triplets for object detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6568–6577, 2019.
  9. The stable marriage problem: An interdisciplinary review from the physicist’s perspective. 2021.
  10. Fast convergence of detr with spatially modulated co-attention. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3601–3610, 2021.
  11. Yolox: Exceeding yolo series in 2021. ArXiv, abs/2107.08430, 2021.
  12. Ross B. Girshick. Fast r-cnn. 2015 IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015.
  13. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2013.
  14. Jocher Glenn. Yolov5 release v6.2. https://github.com/ ultralytics/yolov5/releases/tag/v6.2, 2022.
  15. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
  16. Softer-nms: Rethinking bounding box regression for accurate object detection. ArXiv, abs/1809.08545, 2018.
  17. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  18. Large language models are zero-shot reasoners. ArXiv, abs/2205.11916, 2022.
  19. Cornernet: Detecting objects as paired keypoints. International Journal of Computer Vision, 128:642–656, 2018.
  20. Yolov6: A single-stage object detection framework for industrial applications. ArXiv, abs/2209.02976, 2022.
  21. Dn-detr: Accelerate detr training by introducing query denoising. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13609–13617, 2022.
  22. Feature pyramid networks for object detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2016.
  23. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:318–327, 2017.
  24. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  25. Adaptive nms: Refining pedestrian detection in a crowd. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6452–6461, 2019.
  26. Dab-detr: Dynamic anchor boxes are better queries for detr. In International Conference on Learning Representations.
  27. Path aggregation network for instance segmentation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8759–8768, 2018.
  28. Ssd: Single shot multibox detector. In European Conference on Computer Vision, 2015.
  29. Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
  30. Conditional detr for fast training convergence. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3631–3640, 2021.
  31. You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2015.
  32. Yolo9000: Better, faster, stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6517–6525, 2016.
  33. Yolov3: An incremental improvement. ArXiv, abs/1804.02767, 2018.
  34. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015.
  35. Objects365: A large-scale, high-quality dataset for object detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8429–8438, 2019.
  36. What makes for end-to-end object detection? In International Conference on Machine Learning, 2020.
  37. Sparse r-cnn: End-to-end object detection with learnable proposals. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14449–14458, 2020.
  38. Rethinking transformer-based set prediction for object detection. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3591–3600, 2020.
  39. Fcos: Fully convolutional one-stage object detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9626–9635, 2019.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. ArXiv, abs/2207.02696, 2022.
  42. You only learn one representation: Unified network for multiple tasks. ArXiv, abs/2105.04206, 2021.
  43. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. ArXiv, abs/2203.03605, 2022.
  44. What are expected queries in end-to-end object detection? ArXiv, abs/2206.01232, 2022.
  45. Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Haodong Ouyang (4 papers)
Citations (7)