Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism (2309.11331v5)

Published 20 Sep 2023 in cs.CV and cs.AI

Abstract: In the past years, YOLO-series models have emerged as the leading approaches in the area of real-time object detection. Many studies pushed up the baseline to a higher level by modifying the architecture, augmenting data and designing new losses. However, we find previous models still suffer from information fusion problem, although Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) have alleviated this. Therefore, this study provides an advanced Gatherand-Distribute mechanism (GD) mechanism, which is realized with convolution and self-attention operations. This new designed model named as Gold-YOLO, which boosts the multi-scale feature fusion capabilities and achieves an ideal balance between latency and accuracy across all model scales. Additionally, we implement MAE-style pretraining in the YOLO-series for the first time, allowing YOLOseries models could be to benefit from unsupervised pretraining. Gold-YOLO-N attains an outstanding 39.9% AP on the COCO val2017 datasets and 1030 FPS on a T4 GPU, which outperforms the previous SOTA model YOLOv6-3.0-N with similar FPS by +2.4%. The PyTorch code is available at https://github.com/huawei-noah/Efficient-Computing/tree/master/Detection/Gold-YOLO, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/Gold_YOLO.

PDF Abstract

Overview of Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism

The paper under discussion presents Gold-YOLO, an advancement in the field of object detection models, with a particular focus on the YOLO series. The research addresses inherent limitations in previous models related to information fusion. Despite improvements from architectures like FPN and PANet, certain inefficiencies persist. Gold-YOLO introduces a novel Gather-and-Distribute (GD) mechanism using convolution and self-attention operations to enhance multi-scale feature fusion. This development achieves a balance between latency and accuracy across diverse model scales and incorporates MAE-style pretraining in the YOLO series for the first time.

Key Contributions

Gather-and-Distribute Mechanism (GD): Gold-YOLO's central innovation is the GD mechanism, which enhances information sharing across different layers. This mechanism involves feature alignment, information fusion, and information injection, significantly improving feature fusion without increasing latency. The model integrates features from multiple levels to effectively detect objects of varying sizes.
MAE-Style Pretraining: For the first time, the YOLO series benefits from unsupervised pretraining techniques, yielding faster convergence and better accuracy.
Performance Metrics: Gold-YOLO-N achieves 39.9% AP on COCO val2017 and operates at 1030 FPS on a T4 GPU, showcasing superior performance compared to state-of-the-art models like YOLOv6-3.0-N.

Detailed Analysis

The GD mechanism is examined through its subcomponents—low-stage and high-stage gather-and-distribute branches. The low-stage branch focuses on fusing high-resolution features for small target recognition, while the high-stage branch deals with larger features. The mechanism incorporates lightweight adjacent-layer fusion modules leveraging convolutions and attention operations to facilitate efficient global information exchange.

Experimental Results

Gold-YOLO demonstrates substantial improvements over previous YOLO versions, especially in terms of accuracy without sacrificing speed. Compared to existing models, Gold-YOLO delivers a notable performance boost:

YOLOv8-N and YOLOv6-3.0-N: Gold-YOLO-N shows a 2.6–2.4% improvement in AP with similar throughput.
YOLOX-S and PPYOLOE-S: Gold-YOLO-S outperforms by 5.9% and 3.1% with faster processing speeds.

Extensive ablation studies confirm the efficacy of the GD mechanism and its subcomponents, demonstrating enhanced AP and efficient feature integration capabilities.

Implications and Future Directions

The introduction of the GD mechanism offers promising advancements in object detection, emphasizing the critical role of information fusion. By integrating attention-based operations, Gold-YOLO potentially opens avenues for improved multi-task learning and transferability across various vision tasks. Additionally, the adoption of MAE-style pretraining sets a precedent for further exploration of unsupervised pretraining methods within CNN-dominated frameworks.

For future work, extending the GD mechanism to other architectures and tasks could unveil further possibilities in agility and accuracy enhancements, especially in edge computing environments where real-time processing is crucial. The adaptability of this framework to various model sizes remains an area for continued research and innovation.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Chengcheng Wang (14 papers)
Wei He (188 papers)
Ying Nie (15 papers)
Jianyuan Guo (40 papers)
Chuanjian Liu (15 papers)
Kai Han (184 papers)
Yunhe Wang (145 papers)

Citations (93)

View on Semantic Scholar

Related Papers

Find Related Papers