Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DETRs Beat YOLOs on Real-time Object Detection (2304.08069v3)

Published 17 Apr 2023 in cs.CV

Abstract: The YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However, we observe that the speed and accuracy of YOLOs are negatively affected by the NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. In this paper, we propose the Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge that addresses the above dilemma. We build RT-DETR in two steps, drawing on the advanced DETR: first we focus on maintaining accuracy while improving speed, followed by maintaining speed while improving accuracy. Specifically, we design an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed. Then, we propose the uncertainty-minimal query selection to provide high-quality initial queries to the decoder, thereby improving accuracy. In addition, RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to adapt to various scenarios without retraining. Our RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU, outperforming previously advanced YOLOs in both speed and accuracy. We also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and M models). Furthermore, RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy and about 21 times in FPS. After pre-training with Objects365, RT-DETR-R50 / R101 achieves 55.3% / 56.2% AP. The project page: https://zhao-yian.github.io/RTDETR.

DETRs Beat YOLOs on Real-time Object Detection

The paper "DETRs Beat YOLOs on Real-time Object Detection," authored by Wenyu Lv et al. from Baidu Inc., presents a detailed comparison between Detection Transformers (DETRs) and the You Only Look Once (YOLO) family of models regarding their performance in real-time object detection tasks.

Abstract and Introduction

The authors begin by outlining the advancements in object detection, highlighting the significant impact of YOLO models in real-time applications due to their high speed and accuracy. However, the paper aims to challenge the dominance of YOLOs by proposing that DETR models achieve superior performance under real-time constraints.

Related Work

The related work section provides an extensive review of both YOLO-based models and the relatively newer DETR models. The YOLO models, renowned for their lightweight architecture and rapid inference capabilities, have been widely adopted in various real-time applications. Conversely, DETR models, benefiting from the transformer architecture, have shown promise in achieving higher accuracy and robustness in object detection tasks but at a higher computational cost. This section sets the stage for the authors' argument by highlighting the strengths and weaknesses of each approach.

Speed Considerations

One of the crucial sections of the paper examines the speed-performance trade-offs between DETRs and YOLOs. The authors introduce various optimization techniques applied to DETRs to enhance their inference speed, such as optimizing the attention mechanism and reducing the model's computational overhead. They provide a comprehensive analysis of the inference times, demonstrating that when these optimizations are incorporated, DETRs can achieve competitive, if not superior, real-time performance compared to YOLO models.

Methodology

The methodology section describes the experimental setup used to evaluate the models. The authors meticulously detail the datasets employed, the evaluation metrics, and the specific configurations of both DETR and YOLO models. They also describe the hyperparameters and the training protocols followed to ensure a fair comparison. This rigorous approach ensures that the results presented are robust and reproducible.

Experimental Results

The experimental results form the core contribution of this paper. The authors present a series of experiments comparing the performance of DETR and YOLO models across various datasets. They highlight that DETRs, when appropriately optimized, not only match but in several cases outperform YOLO models in real-time settings. The paper provides strong numerical results, demonstrating improvements in detection precision and recall rates while maintaining acceptable inference speeds for real-time applications.

Conclusions and Implications

In the conclusion section, the authors summarize their findings, asserting that optimized DETR models present a viable alternative to YOLOs for real-time object detection tasks. They discuss the practical implications of their research, suggesting that industries reliant on real-time object detection might consider transitioning to DETR-based models to leverage their enhanced accuracy and robustness. The paper also hints at potential future work, such as further optimization techniques for DETR models and exploring their applicability in other real-time computer vision tasks.

Theoretical and Practical Implications

From a theoretical perspective, the paper's findings challenge the prevailing notion that transformer-based models are unsuitable for real-time applications due to their computational complexity. It opens avenues for further research into optimizing transformers for speed without compromising their accuracy benefits. Practically, this research could influence the design of next-generation real-time detection systems, potentially leading to more accurate and reliable applications in fields such as autonomous driving, surveillance, and robotics.

Future Developments

Future developments following this research might include deeper investigations into more efficient transformer architectures, the integration of hardware accelerators to further reduce inference times, and broader evaluations across different real-time scenarios to validate the generalizability of the findings.

In conclusion, Wenyu Lv et al.'s paper makes a compelling case for the adoption of DETR models in real-time object detection, provided that appropriate optimizations are implemented. This work is a significant step towards bridging the gap between the high accuracy of transformer models and the speed requirements of real-time applications, offering promising directions for both research and practical implementations in AI-driven object detection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
  2. Anomaly detection in autonomous driving: A survey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4488–4499, 2022.
  3. Reversible column networks. arXiv preprint arXiv:2212.11696, 2022.
  4. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
  5. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
  6. Group detr: Fast training convergence with decoupled one-to-many label assignment. arXiv preprint arXiv:2207.13085, 2022.
  7. Group detr v2: Strong object detector with encoder-decoder pretraining. arXiv preprint arXiv:2211.03594, 2022.
  8. Beyond self-supervision: A simple yet effective network distillation alternative to improve backbones. CoRR, abs/2103.05959, 2021.
  9. Fast convergence of detr with spatially modulated co-attention. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3621–3630, 2021.
  10. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  11. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7036–7045, 2019.
  12. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  13. Jocher Glenn. Yolov5 release v7.0. https://github.com/ultralytics/yolov5/tree/v7.0, 2022.
  14. Jocher Glenn. Yolov8. https://github.com/ultralytics/ultralytics/tree/main, 2023.
  15. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  16. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 558–567, 2019.
  17. Pp-yolov2: A practical object detector. arXiv preprint arXiv:2104.10419, 2021.
  18. Yolov6 v3. 0: A full-scale reloading. arXiv preprint arXiv:2301.05586, 2023.
  19. Lite detr: An interleaved multi-scale encoder for efficient detr. arXiv preprint arXiv:2303.07335, 2023.
  20. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13619–13627, 2022.
  21. D^ 2etr: Decoder-only detr with computationally efficient cross-scale attention. arXiv preprint arXiv:2203.00860, 2022.
  22. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  23. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  24. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022.
  25. Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8759–8768, 2018.
  26. Pp-yolo: An effective and efficient implementation of object detector. arXiv preprint arXiv:2007.12099, 2020.
  27. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3651–3660, 2021.
  28. Spatiotemporal anomaly detection using deep learning for real-time video surveillance. IEEE Transactions on Industrial Informatics, 16(1):393–402, 2019.
  29. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  30. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  31. A strong and reproducible object detector with only public datasets, 2023.
  32. Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv preprint arXiv:2111.14330, 2021.
  33. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  34. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14454–14463, 2021.
  35. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020.
  36. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019.
  37. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 13029–13038, 2021.
  38. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696, 2022.
  39. End-to-end object detection with fully convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15849–15858, 2021.
  40. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 2567–2575, 2022.
  41. Pp-yoloe: An evolved version of yolo. arXiv preprint arXiv:2203.16250, 2022.
  42. Focal modulation networks. Advances in Neural Information Processing Systems, 35:4203–4217, 2022.
  43. Efficient detr: improving end-to-end object detector with dense prior. arXiv preprint arXiv:2104.01318, 2021.
  44. A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 8:58443–58469, 2020.
  45. Motr: End-to-end multiple-object tracking with transformer. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII, pages 659–675. Springer, 2022.
  46. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In The Eleventh International Conference on Learning Representations, 2022.
  47. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8514–8523, 2021.
  48. Bytetrack: Multi-object tracking by associating every detection box. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pages 1–21. Springer, 2022.
  49. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  50. Detrs with collaborative hybrid assignments training. arXiv preprint arXiv:2211.12860, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Wenyu Lv (8 papers)
  2. Yian Zhao (12 papers)
  3. Shangliang Xu (3 papers)
  4. Jinman Wei (2 papers)
  5. Guanzhong Wang (34 papers)
  6. Qingqing Dang (15 papers)
  7. Yi Liu (543 papers)
  8. Jie Chen (602 papers)
Citations (409)
Youtube Logo Streamline Icon: https://streamlinehq.com