YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information (2402.13616v2)

Published 21 Feb 2024 in cs.CV

Abstract: Today's deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will be lost. This paper will delve into the important issues of data loss when data is transmitted through deep networks, namely information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights. In addition, a new lightweight network architecture -- Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning is designed. GELAN's architecture confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO dataset based object detection. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets, the comparison results are shown in Figure 1. The source codes are at: https://github.com/WongKinYiu/yolov9.

References (82)

Citations (526)

View on Semantic Scholar

Summary

The paper introduces Programmable Gradient Information (PGI) to mitigate feature loss in deep networks.
It presents GELAN, a lightweight, adaptable architecture that enhances detection accuracy and efficiency.
Experiments on MS COCO demonstrate YOLOv9’s superior performance using fewer parameters than previous models.

YOLOv9: Enhancing Object Detection with Programmable Gradient Information and Generalized Efficient Layer Aggregation Network

Introduction

The relentless pursuit of optimizing deep learning systems for object detection tasks has led to an array of innovations; however, models often grapple with the challenge of information loss during data transmission through deep networks. This paper introduces YOLOv9, which leverages Programmable Gradient Information (PGI) and a novel network architecture named Generalized Efficient Layer Aggregation Network (GELAN). These innovations collectively address issues related to the information bottleneck and reversible functions, aiming to retain as much input information as possible for accurate predictions.

Programmable Gradient Information (PGI)

The authors propose PGI as a solution to circumvent the gradual loss of information—a problem accentuated in traditional deep network training approaches. PGI is designed to generate reliable gradients via an auxiliary reversible branch, ensuring deep features maintain critical characteristics necessary for the target task. This methodology enables the gradients to guide the network towards retaining relevant features instead of erroneous or irrelevant ones, providing a sturdier foundation for model updates and ultimately, more accurate predictions.

GELAN: A New Lightweight Architecture

Alongside PGI, the paper introduces GELAN, a lightweight network architecture inspired by ELAN but extended to support various computational blocks. This versatility allows GELAN to adapt to different computational needs and devices without compromising performance. Initial results confirm that GELAN, when combined with PGI, outperforms existing lightweight models in terms of parameter efficiency and accuracy across various conditions.

Validation on MS COCO Dataset

Extensive experiments on the MS COCO dataset underline the effectiveness of YOLOv9. Noteworthy is the comparison against other high-performing object detectors like YOLOv8 and YOLO MS models. YOLOv9 demonstrates superior performance, particularly in utilizing fewer parameters and computational resources, all while delivering higher accuracy. These results validate the proposed approach's potential to set new benchmarks for real-time object detection tasks.

Implications and Future Directions

The introduction of PGI addresses a critical issue that has hindered the full exploitation of deep neural networks in object detection—information loss. By ensuring the retention of essential information throughout the model's layers, YOLOv9 presents a promising pathway for developing efficient and accurate object detection systems. Looking ahead, the explorations into reversible functions and their integration into deep learning architectures could unlock further improvements in model performance and efficiency.

Moreover, the flexibility and efficiency of GELAN mark a significant step towards adaptable and scalable architectural designs. Such designs could cater to a broader range of applications and computational settings, from mobile devices with limited processing capabilities to high-performance computing systems.

This work not only contributes to the ongoing evolution of object detection systems but also lays a foundation for future research in optimizing neural network architectures and training processes. The open-source availability of YOLOv9's implementation ensures that the wider research community can build upon these findings, fostering further innovation and development in the field.

In sum, YOLOv9 represents a notable advancement in object detection, combining innovative strategies to overcome longstanding challenges in the field. Its adaptability, efficiency, and superior performance underscore the potential for continued progress in designing more capable and resource-efficient models for real-time object detection and beyond.