Papers
Topics
Authors
Recent
2000 character limit reached

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information (2402.13616v2)

Published 21 Feb 2024 in cs.CV

Abstract: Today's deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will be lost. This paper will delve into the important issues of data loss when data is transmitted through deep networks, namely information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights. In addition, a new lightweight network architecture -- Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning is designed. GELAN's architecture confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO dataset based object detection. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets, the comparison results are shown in Figure 1. The source codes are at: https://github.com/WongKinYiu/yolov9.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. BEiT: BERT pre-training of image transformers. In International Conference on Learning Representations (ICLR), 2022.
  2. YOLOv4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
  3. Reversible column networks. In International Conference on Learning Representations (ICLR), 2023.
  4. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 213–229, 2020.
  5. AP-loss for accurate one-stage object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11):3782–3798, 2020.
  6. SdAE: Self-distillated masked autoencoder. In Proceedings of the European Conference on Computer Vision (ECCV), pages 108–124, 2022.
  7. YOLO-MS: rethinking multi-scale representation learning for real-time object detection. arXiv preprint arXiv:2308.05480, 2023.
  8. DaVIT: Dual attention vision transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 74–92, 2022.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  10. TOOD: Task-aligned one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3490–3499, 2021.
  11. Res2Net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(2):652–662, 2019.
  12. OTA: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 303–312, 2021.
  13. YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  14. Jocher Glenn. YOLOv5 release v7.0. https://github.com/ultralytics/yolov5/releases/tag/v7.0, 2022.
  15. Jocher Glenn. YOLOv8 release v8.1.0. https://github.com/ultralytics/ultralytics/releases/tag/v8.1.0, 2024.
  16. The reversible residual network: Backpropagation without storing activations. Advances in Neural Information Processing Systems (NeurIPS), 2017.
  17. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  18. AugFPN: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12595–12604, 2020.
  19. RevColV2: Exploring disentangled representations in masked image modeling. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  20. Boundary-aware instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5696–5704, 2017.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  22. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 630–645. Springer, 2016.
  23. Densely connected convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4700–4708, 2017.
  24. MonoDTR: Monocular 3D object detection with depth-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4012–4021, 2022.
  25. YOLOCS: Object detection based on dense channel compression for feature spatial solidification. arXiv preprint arXiv:2305.04170, 2023.
  26. Perceiver: General perception with iterative attention. In International Conference on Machine Learning (ICML), pages 4651–4664, 2021.
  27. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, volume 1, page 2, 2019.
  28. Deeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.
  29. DATNet: Dense auxiliary tasks for object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1419–1427, 2020.
  30. YOLOv6 v3.0: A full-scale reloading. arXiv preprint arXiv:2301.05586, 2023.
  31. YOLOv6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976, 2022.
  32. Uni-perceiver v2: A generalist model for large-scale vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2691–2700, 2023.
  33. A dual weighting label assignment scheme for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9387–9396, 2022.
  34. CBNet: A composite backbone network architecture for object detection. IEEE Transactions on Image Processing (TIP), 2022.
  35. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017.
  36. DynamicDet: A unified dynamic architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6282–6291, 2023.
  37. Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8768, 2018.
  38. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  39. CBNet: A novel composite backbone network architecture for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 11653–11660, 2020.
  40. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  41. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021.
  42. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11976–11986, 2022.
  43. DETRs beat YOLOs on real-time object detection. arXiv preprint arXiv:2304.08069, 2023.
  44. RTMDet: An empirical study of designing real-time object detectors. arXiv preprint arXiv:2212.07784, 2022.
  45. A ranking-based, balanced loss function unifying classification and localisation in object detection. Advances in Neural Information Processing Systems (NeurIPS), 33:15534–15545, 2020.
  46. Rank & sort loss for object detection and instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3009–3018, 2021.
  47. You only look once: Unified, real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
  48. YOLO9000: better, faster, stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7263–7271, 2017.
  49. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  50. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 658–666, 2019.
  51. Object detection from scratch with deep supervision. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(2):398–412, 2019.
  52. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning (CoRL), pages 785–799, 2023.
  53. What makes for end-to-end object detection? In International Conference on Machine Learning (ICML), pages 9934–9944, 2021.
  54. Going deeper with convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
  55. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
  56. Perceiver-VL: Efficient vision-and-language modeling with iterative latent attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4410–4420, 2023.
  57. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9627–9636, 2019.
  58. FCOS: A simple and strong anchor-free object detector. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(4):1922–1933, 2022.
  59. Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop (ITW), pages 1–5, 2015.
  60. MaxVIT: Multi-axis vision transformer. In Proceedings of the European Conference on Computer Vision (ECCV), pages 459–479, 2022.
  61. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  62. Scaled-YOLOv4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13029–13038, 2021.
  63. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7464–7475, 2023.
  64. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 390–391, 2020.
  65. Designing network design strategies through gradient path analysis. Journal of Information Science and Engineering (JISE), 39(4):975–995, 2023.
  66. You only learn one representation: Unified network for multiple tasks. Journal of Information Science & Engineering (JISE), 39(3):691–709, 2023.
  67. End-to-end object detection with fully convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15849–15858, 2021.
  68. Training deeper convolutional networks with deep supervision. arXiv preprint arXiv:1505.02496, 2015.
  69. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 568–578, 2021.
  70. PVT v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
  71. ConvNeXt v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16133–16142, 2023.
  72. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1492–1500, 2017.
  73. SimMIM: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, 2022.
  74. PP-YOLOE: An evolved version of YOLO. arXiv preprint arXiv:2203.16250, 2022.
  75. DAMO-YOLO: A report on real-time object detection design. arXiv preprint arXiv:2211.15444, 2022.
  76. MonoDETR: Depth-guided transformer for monocular 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9155–9166, 2023.
  77. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, pages 12993–13000, 2020.
  78. IoU loss for 2D/3D object detection. In International Conference on 3D Vision (3DV), pages 85–94, 2019.
  79. AutoAssign: Differentiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496, 2020.
  80. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
  81. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16804–16815, 2022.
  82. DETRs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6748–6758, 2023.
Citations (526)

Summary

  • The paper introduces Programmable Gradient Information (PGI) to mitigate feature loss in deep networks.
  • It presents GELAN, a lightweight, adaptable architecture that enhances detection accuracy and efficiency.
  • Experiments on MS COCO demonstrate YOLOv9’s superior performance using fewer parameters than previous models.

YOLOv9: Enhancing Object Detection with Programmable Gradient Information and Generalized Efficient Layer Aggregation Network

Introduction

The relentless pursuit of optimizing deep learning systems for object detection tasks has led to an array of innovations; however, models often grapple with the challenge of information loss during data transmission through deep networks. This paper introduces YOLOv9, which leverages Programmable Gradient Information (PGI) and a novel network architecture named Generalized Efficient Layer Aggregation Network (GELAN). These innovations collectively address issues related to the information bottleneck and reversible functions, aiming to retain as much input information as possible for accurate predictions.

Programmable Gradient Information (PGI)

The authors propose PGI as a solution to circumvent the gradual loss of information—a problem accentuated in traditional deep network training approaches. PGI is designed to generate reliable gradients via an auxiliary reversible branch, ensuring deep features maintain critical characteristics necessary for the target task. This methodology enables the gradients to guide the network towards retaining relevant features instead of erroneous or irrelevant ones, providing a sturdier foundation for model updates and ultimately, more accurate predictions.

GELAN: A New Lightweight Architecture

Alongside PGI, the paper introduces GELAN, a lightweight network architecture inspired by ELAN but extended to support various computational blocks. This versatility allows GELAN to adapt to different computational needs and devices without compromising performance. Initial results confirm that GELAN, when combined with PGI, outperforms existing lightweight models in terms of parameter efficiency and accuracy across various conditions.

Validation on MS COCO Dataset

Extensive experiments on the MS COCO dataset underline the effectiveness of YOLOv9. Noteworthy is the comparison against other high-performing object detectors like YOLOv8 and YOLO MS models. YOLOv9 demonstrates superior performance, particularly in utilizing fewer parameters and computational resources, all while delivering higher accuracy. These results validate the proposed approach's potential to set new benchmarks for real-time object detection tasks.

Implications and Future Directions

The introduction of PGI addresses a critical issue that has hindered the full exploitation of deep neural networks in object detection—information loss. By ensuring the retention of essential information throughout the model's layers, YOLOv9 presents a promising pathway for developing efficient and accurate object detection systems. Looking ahead, the explorations into reversible functions and their integration into deep learning architectures could unlock further improvements in model performance and efficiency.

Moreover, the flexibility and efficiency of GELAN mark a significant step towards adaptable and scalable architectural designs. Such designs could cater to a broader range of applications and computational settings, from mobile devices with limited processing capabilities to high-performance computing systems.

This work not only contributes to the ongoing evolution of object detection systems but also lays a foundation for future research in optimizing neural network architectures and training processes. The open-source availability of YOLOv9's implementation ensures that the wider research community can build upon these findings, fostering further innovation and development in the field.

In sum, YOLOv9 represents a notable advancement in object detection, combining innovative strategies to overcome longstanding challenges in the field. Its adaptability, efficiency, and superior performance underscore the potential for continued progress in designing more capable and resource-efficient models for real-time object detection and beyond.

Whiteboard

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 36 tweets with 1924 likes about this paper.