Papers
Topics
Authors
Recent
Search
2000 character limit reached

YOLO v4: Real-Time Object Detection

Updated 22 June 2026
  • YOLOv4 is a one-stage, anchor-based real-time object detector that combines a CSPDarknet-53 backbone, SPP+PANet neck, and dense prediction head for high accuracy and speed.
  • Its composite loss functions, including CIoU for bounding box regression and binary cross-entropy for objectness, optimize detection performance through effective training strategies.
  • Innovations such as mosaic augmentation, CutMix, self-adversarial training, and cross mini-Batch Normalization enhance robustness and enable efficient deployment on both high-end GPUs and edge devices.

YOLOv4 is a one-stage, anchor-based real-time object detector that marked a major advancement in the YOLO family by unifying a highly optimized feature-extraction backbone, a multi-scale feature aggregation neck, and an efficient dense prediction head. Its architecture, regularization, and training methodology collectively push the accuracy-speed Pareto frontier in object detection, achieving substantial gains over prior versions while remaining deployable on commodity hardware (Kotthapalli et al., 4 Aug 2025, Bochkovskiy et al., 2020, Geetha, 6 Feb 2025, Ramos et al., 24 Apr 2025, Terven et al., 2023).

1. Architecture

The overall pipeline adopts a three-module organization—backbone, neck, and detection head—each selected or designed for accuracy, speed, and memory efficiency:

  • Backbone: CSPDarknet-53
  • Neck: Spatial Pyramid Pooling (SPP) + Path Aggregation Network (PANet)
    • The SPP module applies parallel max pooling with kernels {5×5, 9×9, 13×13} (fixed stride and padding) to the deepest feature map. Outputs are concatenated channel-wise to the original map, enlarging the effective receptive field without additional downsampling (Kotthapalli et al., 4 Aug 2025, Bochkovskiy et al., 2020).
    • The PANet provides a top-down and bottom-up information flow, fusing shallow, intermediate, and deep features. Lateral connections use concatenation, producing three feature maps at different scales, effective for multi-scale object localization (Geetha, 6 Feb 2025, Kotthapalli et al., 4 Aug 2025).
  • Detection Head
    • YOLOv4 employs three parallel dense prediction heads (one per PANet output), each predicting for every cell and anchor: box offsets (Δx,Δy,Δw,Δh)(\Delta x, \Delta y, \Delta w, \Delta h), objectness score, and CC class probabilities. Mish is used in the backbone/neck, and Leaky ReLU in the heads (Kotthapalli et al., 4 Aug 2025).
    • Prediction heads operate at three resolutions (typically $1/32$, $1/16$, $1/8$ of input). Outputs are post-processed with DIoU or CIoU-guided Non-Maximum Suppression.

Table: YOLOv4 Layerwise Structure (excerpt, (Geetha, 6 Feb 2025)) | Layer | Kernel | Output Size (608 input) | |------------|----------|--------------------------| | Conv (CSP) | 3×3 | 608×608 | | ... | ... | ... | | SPP | 1×1/5×5/9×9/13×13 | 19×19 | | PANet | — | 19×19, 38×38, 76×76 | | Head | — | As above |

2. Loss Functions

YOLOv4 uses a composite loss comprising box regression, objectness, and classification:

  • Box Regression: Complete IoU (CIoU) loss is used for bounding box regression, improving upon IoU, GIoU, and DIoU by incorporating overlap area, center distance, and aspect ratio consistency. The CIoU loss is:

LCIoU=1IoU+ρ2(b,bgt)c2+ανL_{\mathrm{CIoU}} = 1 - \mathrm{IoU} + \frac{\rho^2(\mathbf{b},\mathbf{b}^{gt})}{c^2} + \alpha\,\nu

where ρ2\rho^2 is the squared distance between box centers, cc is the diagonal length of smallest enclosing box, ν\nu penalizes aspect ratio discrepancy, and α\alpha balances the shape term (Bochkovskiy et al., 2020, Ramos et al., 24 Apr 2025, Geetha, 6 Feb 2025).

  • Objectness/Classification: Binary cross-entropy (BCE) loss is applied to both objectness and multi-label class predictions.
  • Total Loss:

CC0

with CC1 weights typically set heuristically (Kotthapalli et al., 4 Aug 2025). Anchors assigned responsibility for a ground-truth box participate in CC2 and CC3; all negative anchors contribute to CC4.

3. Core Innovations and Training Techniques

YOLOv4 systematically introduces "Bag of Freebies" (training augmentations and regularization) and "Bag of Specials" (lightweight, inference-time modules):

4. Performance and Benchmarking

YOLOv4 established itself as a leading one-stage detector in both speed and accuracy. Key benchmarks on the MS COCO dataset (test-dev split), using a single NVIDIA V100 or RTX 2080 Ti GPU:

Model Backbone [email protected]:.95 [email protected] FPS Input
YOLOv3 Darknet-53 33.0% 57.9% 30–45* 608×608
YOLOv4 CSPDarknet-53 43.5% 65.7% 62 608×608
YOLOv4 CSPDarknet-53 43.0% 64.9% 83 512×512

*YOLOv3's [email protected]:.95 is ≈33%; FPS varies by GPU, with 20–30 FPS typical on Maxwell.

YOLOv4 matches or surpasses COCO accuracy of two-stage detectors (e.g., Faster R-CNN with ResNet-101 FPN at 42.1% mAP) while realizing order-of-magnitude higher frame rates (Kotthapalli et al., 4 Aug 2025, Ramos et al., 24 Apr 2025, Geetha, 6 Feb 2025). Through ablative studies, cumulative AP gains were shown to result from CSP (+1–2%), SPP (+1%), PANet (+1%), Mish (+0.5%), CutMix/Mosaic (+1.5–2%), DropBlock (+0.5%), and SAT (+1%) (Kotthapalli et al., 4 Aug 2025, Bochkovskiy et al., 2020).

5. Deployment and Implementation Considerations

YOLOv4’s design is geared toward both high-throughput research scenarios and affordable industrial deployment:

  • Parameter Count and Model Size: Full YOLOv4 (CSPDarknet-53, SPP, PANet, heads) contains ≈64M parameters; model size is ≈110 MB in FP32 (Kotthapalli et al., 4 Aug 2025, Geetha, 6 Feb 2025).
  • Hardware: Achieves 62–65 FPS on a Tesla V100 or RTX 2080 Ti for 608×608 input, and up to 96 FPS at 416×416 (with a minor mAP drop, to ~41.2%) (Geetha, 6 Feb 2025, Ramos et al., 24 Apr 2025).
  • Edge Device Support: YOLOv4-Tiny and quantized (FP16/INT8) models are suitable for Jetson TX2/TX/AGX Xavier, typically at 8–12 FPS.
  • Quantization and Pruning: Compatible with standard TensorRT, OpenVINO pipelines for batch-norm folding, Conv+activation fusion, and structured channel pruning, though not detailed in the survey (Kotthapalli et al., 4 Aug 2025).
  • Hyperparameters: Recommended training batch size is 64 (with subdivisions for lower-memory GPUs), SGD momentum 0.949, initial learning rate 0.01, decayed twice, and block size 7 for DropBlock on 38×38 maps (Geetha, 6 Feb 2025).

6. Comparative Advances over YOLOv3

YOLOv4 surpasses YOLOv3 by introducing architectural, loss, and optimization changes:

7. Significance and Impact

YOLOv4 established a new benchmark for the real-time object detection community, combining accuracy exceeding previous one-stage and many two-stage models with real-time inference on standard GPUs, and accessibility for single-GPU research and industrial scenarios. Its mixture of architectural engineering, carefully layered regularization, and training tricks—alongside the modular design of CSPDarknet-53, SPP, and PANet—provided a reusable template for subsequent YOLO generations, shaping advances in robust and efficient computer vision (Kotthapalli et al., 4 Aug 2025, Ramos et al., 24 Apr 2025, Terven et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YOLO v4.