Scaled-YOLOv4: Scaling Cross Stage Partial Network (2011.08036v2)

Published 16 Nov 2020 in cs.CV and cs.LG

Abstract: We show that the YOLOv4 object detection neural network based on the CSP approach, scales both up and down and is applicable to small and large networks while maintaining optimal speed and accuracy. We propose a network scaling approach that modifies not only the depth, width, resolution, but also structure of the network. YOLOv4-large model achieves state-of-the-art results: 55.5% AP (73.4% AP50) for the MS COCO dataset at a speed of ~16 FPS on Tesla V100, while with the test time augmentation, YOLOv4-large achieves 56.0% AP (73.3 AP50). To the best of our knowledge, this is currently the highest accuracy on the COCO dataset among any published work. The YOLOv4-tiny model achieves 22.0% AP (42.0% AP50) at a speed of 443 FPS on RTX 2080Ti, while by using TensorRT, batch size = 4 and FP16-precision the YOLOv4-tiny achieves 1774 FPS.

PDF Abstract

Scaled-YOLOv4: Scaling Cross Stage Partial Network

The paper "Scaled-YOLOv4: Scaling Cross Stage Partial Network," authored by Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao, presents significant advancements in object detection by leveraging the YOLOv4 architecture enhanced with Cross Stage Partial Network (CSPNet) techniques. This network scaling approach performs modifications not only on depth, width, and resolution but also on the intricate structures of the network itself.

Summary of Contributions

The core contributions of the paper are multi-faceted, aiming to balance model accuracy and speed across various device capabilities:

Model Scaling Methodology: The paper introduces a scaled YOLOv4 that adapts to different model sizes—YOLOv4-large and YOLOv4-tiny—achieving optimal performance trade-offs.
YOLOv4-large Performance: Demonstrating state-of-the-art results, the YOLOv4-large model achieves 55.5\% AP (73.4\% AP $_{50}$ ) on the MS COCO dataset, operating at $\sim$ 16 FPS on Tesla V100.
YOLOv4-tiny Performance: Targeting low-end devices, YOLOv4-tiny records 22.0\% AP (42.0\% AP $_{50}$ ) at $\sim$ 443 FPS on RTX 2080Ti. Using enhancements like TensorRT with FP16 precision, it achieves an impressive 1774 FPS.
Robust Scaling for Diverse Hardware: The research investigates scaling pathways that include adjustments in image size, network width, and stage count while maintaining high computational efficiency on a range of devices from embedded systems to high-end GPUs.

Detailed Approach

Backbone and Neck Transformations

CSPNet Integration: CSPDarknet53 forms the backbone of the scaled-YOLOv4. By modifying this architecture with CSP, the authors optimized it to substantially reduce parameters and computational load.
Feature Pyramid Network (FPN): Enhanced with CSP, the PAN architecture in YOLOv4 achieved notable computational reductions, effectively streamlining the neural network without sacrificing accuracy.

YOLOv4-large Models

For high-end configurations, YOLOv4-large includes three variants: YOLOv4-P5, YOLOv4-P6, and YOLOv4-P7. Each model is incrementally scaled for increased input size, stage count, and width. Specifically, YOLOv4-P6 and YOLOv4-P7 are designed to keep real-time performance metrics viable while scaling up network capacity to handle higher resolutions.

Experimental Results

Performance evaluations were conducted on the MS COCO 2017 dataset:

Strong Results on COCO: Scaling up the YOLOv4 architecture demonstrated remarkable improvements in AP metrics while maintaining inference speed. Notably, the YOLOv4-P7 model with TTA (Test-Time Augmentation) peaked at 56.0\% AP.
Embedded Systems: With YOLOv4-tiny optimized for lower-end GPUs, the model ensured real-time performance on embedded devices like NVIDIA's Jetson series, aligning with practical deployment scenarios in IoT and edge computing applications.

Implications and Future Work

The primary theoretical implication lies in the demonstration that FPN-like architectures can serve as naïve once-for-all models, supporting sub-network utilization as effective object detectors at various scales. Practically, the scalable nature of YOLOv4 variants broadens the usability spectrum from resource-constrained environments to high-performance scenarios, emphasizing flexibility and performance.

Future research avenues may explore further refinements in the trade-off between speed and accuracy, possibly integrating advanced NAS techniques or further optimizing CSPNet configurations to adapt to newer hardware capabilities and applications.

The research successfully illustrates that through systematic scaling and architecture optimization, significant gains in object detection performance can be achieved across a diverse range of computational resources, underlining the adaptability and robustness of the YOLOv4 framework enhanced by CSPNet techniques.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Chien-Yao Wang (15 papers)
Alexey Bochkovskiy (5 papers)
Hong-Yuan Mark Liao (17 papers)

Citations (1,042)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos