YOLOv10: Real-Time End-to-End Object Detection (2405.14458v2)

Published 23 May 2024 in cs.CV

Abstract: Over the past years, YOLOs have emerged as the predominant paradigm in the field of real-time object detection owing to their effective balance between computational cost and detection performance. Researchers have explored the architectural designs, optimization objectives, data augmentation strategies, and others for YOLOs, achieving notable progress. However, the reliance on the non-maximum suppression (NMS) for post-processing hampers the end-to-end deployment of YOLOs and adversely impacts the inference latency. Besides, the design of various components in YOLOs lacks the comprehensive and thorough inspection, resulting in noticeable computational redundancy and limiting the model's capability. It renders the suboptimal efficiency, along with considerable potential for performance improvements. In this work, we aim to further advance the performance-efficiency boundary of YOLOs from both the post-processing and model architecture. To this end, we first present the consistent dual assignments for NMS-free training of YOLOs, which brings competitive performance and low inference latency simultaneously. Moreover, we introduce the holistic efficiency-accuracy driven model design strategy for YOLOs. We comprehensively optimize various components of YOLOs from both efficiency and accuracy perspectives, which greatly reduces the computational overhead and enhances the capability. The outcome of our effort is a new generation of YOLO series for real-time end-to-end object detection, dubbed YOLOv10. Extensive experiments show that YOLOv10 achieves state-of-the-art performance and efficiency across various model scales. For example, our YOLOv10-S is 1.8$\times$ faster than RT-DETR-R18 under the similar AP on COCO, meanwhile enjoying 2.8$\times$ smaller number of parameters and FLOPs. Compared with YOLOv9-C, YOLOv10-B has 46\% less latency and 25\% fewer parameters for the same performance.

PDF Abstract

YOLOv10: Real-Time End-to-End Object Detection

The research paper titled "YOLOv10: Real-Time End-to-End Object Detection" presents a significant advancement in the domain of real-time object detection. Building upon the strengths and addressing the inefficiencies of prior YOLO versions, the authors introduce a novel approach that aims to enhance both the performance and efficiency of these models.

Key Contributions

NMS-Free Training with Consistent Dual Assignments: Traditionally, YOLO models have relied on Non-Maximum Suppression (NMS) for post-processing, which can adversely affect inference latency. Addressing this, the paper introduces a consistent dual assignments strategy that allows for NMS-free training. This consists of leveraging a one-to-many (o2m) assignment to provide rich supervisory signals during training and a one-to-one (o2o) assignment for efficient inference. Importantly, the consistent matching metric ensures that the supervision for the o2o assignment aligns harmoniously with the o2m assignment, thereby reducing the performance gap typically observed between them.
Holistic Efficiency-Accuracy Driven Model Design: The paper proposes several architectural innovations to reduce computational redundancy and enhance performance:
- Lightweight Classification Head: By reducing the overhead of the classification head, the new design helps balance the computational load more effectively.
- Spatial-Channel Decoupled Downsampling: This technique decouples spatial reduction and channel modulation, significantly reducing computational cost.
- Rank-Guided Block Design: This involves analyzing the intrinsic rank of different stages in the model to adaptively implement compact block designs in redundant stages, allowing for more efficient parameter utilization.
Integrating Large-Kernel Convolution and Partial Self-Attention: To boost accuracy, the model incorporates large-kernel convolutions in specific deep stages and a novel Partial Self-Attention (PSA) module. PSA balances the computational complexity of traditional self-attention mechanisms while enhancing global feature representation.

Experimental Results

The proposed YOLOv10 framework was rigorously evaluated on the COCO dataset and demonstrated state-of-the-art performance. A few notable results include:

YOLOv10-S showing a performance of 46.3% AP while being 1.8 times faster than RT-DETR-R18 with similar accuracy.
YOLOv10-B achieving a 46% reduction in latency and 25% fewer parameters compared to YOLOv9-C for the same performance.

These improvements are not just modest enhancements but substantiate the authors' claims about the superior efficacy of their design choices. The numerical improvements outlined in Table \ref{tab:coco} demonstrate significant efficiency gains across various model scales.

Implications and Future Directions

The practical and theoretical implications of this research are profound. Practically, the enhanced efficiency and real-time performance of YOLOv10 can significantly benefit applications in autonomous driving, robotics, and real-time video analytics, among other domains. Theoretically, the method of consistent dual assignments and holistic model redesign could inform future work in machine learning model architecture optimization and training efficiency.

Looking forward, future research could delve into pretraining strategies on more extensive datasets or further optimization of the NMS-free approach to close the minor performance gap observed in smaller models. Such developments could cement YOLOv10 and its successors as the premier choice for real-time object detection tasks.

In conclusion, the paper presents a comprehensive and well-validated enhancement to the YOLO series, addressing critical pain points and pushing the boundaries of what real-time object detection models can achieve. The strategies and insights offered by this research will likely influence and inspire subsequent advancements in the field.