YOLOv4: Optimal Speed and Accuracy of Object Detection (2004.10934v1)

Published 23 Apr 2020 in cs.CV and eess.IV

Abstract: There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet

PDF Abstract

YOLOv4: Optimal Speed and Accuracy of Object Detection

The paper "YOLOv4: Optimal Speed and Accuracy of Object Detection" presents a highly optimized framework for object detection that aims to balance speed and accuracy, making it suitable for real-time applications on standard GPUs. The authors Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao introduce numerous improvements over previous YOLO (You Only Look Once) models, particularly the YOLOv3.

Overview

YOLOv4 integrates several advancements in Convolutional Neural Networks (CNNs) and modern training techniques to deliver an object detector with state-of-the-art performance on the MS COCO dataset. It achieves this without relying on high-end, multi-GPU setups, instead targeting training efficiency on a single conventional GPU. The paper includes an extensive exploration of various techniques, termed as "Bag of Freebies" (BoF) and "Bag of Specials" (BoS), to enhance both the training and inference stages.

Contributions and Techniques

Model Architecture: The backbone of YOLOv4 is CSPDarknet53, which significantly improves learning capabilities and runs efficiently both in terms of speed and memory. New enhancements like Spatial Pyramid Pooling (SPP) and Path Aggregation Network (PAN) are integrated to boost receptive fields and feature integration.
Training Optimizations:
- Bag of Freebies (BoF): These refer to the techniques that improve training without adding computational cost during inference, such as Mosaic data augmentation, Self-Adversarial Training (SAT), and CutMix augmentation.
- Bag of Specials (BoS): These enhancements add a marginal computational overhead but significantly improve detection accuracy. Examples include Mish activation, Squeeze-and-Excitation (SE) blocks, and Spatial Attention Module (SAM).
Extensive Ablation Studies: The paper systematically investigates the individual and combined effects of many training strategies and architecture modifications. For example, the introduction of the Cross mini-Batch Normalization (CmBN) and various bounding box regression losses like CIoU, DIoU, and GIoU are shown to improve the model's robustness and performance.

Numerical Results

YOLOv4 demonstrates significant improvements in both speed and accuracy compared to prior models. Noteworthy results include:

Achieving 43.5% AP and 65.7% AP_50 on the MS COCO dataset with an inference speed of 62 FPS on a Tesla V100 GPU, representing a considerable enhancement over YOLOv3's metrics.
When evaluated on different GPU architectures like Maxwell, Pascal, and Volta, YOLOv4 consistently outperforms contemporary models, maintaining a favorable balance between high FPS (Frames Per Second) and high accuracy.

Implications and Future Directions

From a practical standpoint, YOLOv4's ability to be trained and executed on accessible hardware broadens its applicability in varied fields, including autonomous driving, surveillance, and real-time video analytics, where rapid processing and reliable detections are critical. The efficiency gains and enhanced performance metrics also encourage further research into hybrid approaches combining multiple secondary optimizations.

Theoretically, the paper sets a precedent for future development in CNN-based object detectors by consolidating and validating a diverse range of augmentation and architecture techniques. Future work could explore the extension of these methodologies to more advanced hardware, additional optimization algorithms, and further refinements in balancing computational load versus accuracy.

In conclusion, YOLOv4 exemplifies how exhaustive empirical research combined with thoughtful integration of diverse enhancements can yield a robust object detection framework that is both performant and accessible, fostering further advancements and adoption in the domain of real-time object detection.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Alexey Bochkovskiy (5 papers)
Chien-Yao Wang (15 papers)
Hong-Yuan Mark Liao (17 papers)

Citations (10,831)

View on Semantic Scholar

YOLOv4: Optimal Speed and Accuracy of Object Detection (2004.10934v1)