YOLOv4-Tiny Object Detector

Updated 19 March 2026

YOLOv4-Tiny is a lightweight, real-time object detector that uses a pruned CSP backbone and streamlined FPN to deliver rapid inference on resource-constrained devices.
Its architecture employs multi-scale detection with dual detection heads and CSPOSANet blocks to maintain essential features while reducing latency and memory usage.
Optimized for embedded and edge applications, the model leverages quantization and advanced augmentations to achieve impressive speeds with only a marginal drop in accuracy.

YOLOv4-Tiny is a lightweight, real-time object detector designed to achieve high inference speed with a reduced computational footprint, making it suitable for deployment on mobile, embedded, and edge devices. Constructed as an aggressively downscaled variant of YOLOv4, it retains critical structural elements such as multi-scale detection while employing architectural streamlining (notably CSP-based modules with partial connections) to minimize latency and memory usage. Extensive benchmarking demonstrates that YOLOv4-Tiny can deliver detection speeds two orders of magnitude faster than its full-sized counterpart, with an accuracy trade-off calibrated for applications where resource constraints are paramount (Wang et al., 2020, &&&1&&&, Boddu et al., 10 Jun 2025, Boddu et al., 10 Jun 2025, Verma et al., 2024, Ganesh et al., 2021, Jiang et al., 2020).

1. Network Architecture

YOLOv4-Tiny is based on a pruned Cross Stage Partial (CSP) network backbone, typically CSPDarknet53-tiny or its equivalents, interfaced to a simplified Feature Pyramid Network (FPN) neck and two detection heads. Each backbone block is implemented as a CSPOSANet with a partial-in computational block (PCB) to halve memory access cost. The characteristic structure includes:

Backbone: 2–3 CSPOSANet-PCB blocks of depth $k=3$ per block, where the per-layer channel growth rate $g = b/2$ and the total block output width is $b_\text{out} = 2b$ .
Neck: Two-scale FPN; the highest two backbone resolutions are fed to detection heads via (up)convolutions, with a lightweight feature fusion path.
Heads: Two YOLO detection heads at 26×26 and 13×13, each comprising a sequence of 1×1 and 3×3 convolutions and anchored at three aspect-ratio scales per output grid. Activation is Leaky ReLU throughout, with batch normalization applied after every convolution.

A representative block diagram and detailed per-layer structure are provided in (Wang et al., 2020), with the following summary table highlighting core architectural components (input size 416×416):

Layer	Output Tensor	Notes
Initial Conv	208×208×32	stride 2
CSPOSANet Block 1	208×208×64	$b=32$ , $g=16$ , $k=3$
Conv (downsample)	104×104×64	stride 2
CSPOSANet Block 2	104×104×128	$b=64$ , $g=32$ , $k=3$
Conv (downsample)	52×52×128	stride 2
FPN/Neck	26×26 and 13×13	upconv and concatenation
Detection Heads	26×26×[filters],	anchor-based, 3 per head, [filters] by class/anchor

PCB splits are performed only at the block output, reducing memory access but preserving gradient signal (Wang et al., 2020, Khoramdel et al., 2023).

2. Training Procedures and Loss Functions

Training is typically conducted on datasets such as MS COCO, custom surveillance/health datasets, or domain-specific corpora (e.g., aerial emergency response, PPE).

Augmentation and Optimization:

Augmentations: Mosaic, random flip, hue/saturation/exposure, random cropping, and, optionally, MixUp (Wang et al., 2020, Khoramdel et al., 2023, Verma et al., 2024, Boddu et al., 10 Jun 2025).
Optimizers: Stochastic Gradient Descent with momentum (0.9–0.973), weight decay ( $5\times10^{-4}$ ), cosine-annealed or stepwise learning rate schedules (Khoramdel et al., 2023, Wang et al., 2020).
Initialization: Pretrained weights from COCO or full-scale YOLOv4 serve as initialization for transfer learning scenarios (Verma et al., 2024, Khoramdel et al., 2023).

Loss Function:

YOLOv4-Tiny employs the canonical multi-part YOLO loss: $L = L_\text{coord} + L_\text{conf} + L_\text{cls}$ with

Localization loss: Weighted squared error or MSE for $(x, y, w, h)$ .
Objectness (confidence) loss: Binary Cross-Entropy on predicted objectness.
Classification loss: Sum-squared error or BCE over detected classes (Khoramdel et al., 2023, Boddu et al., 10 Jun 2025).

Some variants implement CIoU loss: $L_\mathrm{CIoU} = 1 - \mathrm{IoU} + \frac{\rho^2(\mathbf{b},\mathbf{b}_\mathrm{gt})}{c^2} + \alpha v$ where $\rho^2$ is the squared distance between box centers, $c$ is the diagonal of the enclosing box, $v$ measures aspect-ratio consistency, and $\alpha$ is a trade-off parameter (Bochkovskiy et al., 2020).

Quantization and Model Compression:

INT8 (full-integer) post-training quantization is routinely applied via ONNX Runtime or TensorFlow Lite, reducing model size by ~70% and providing up to 44% faster inference with negligible drop in mAP (<0.5% in reported studies) (Boddu et al., 10 Jun 2025, Boddu et al., 10 Jun 2025).

3. Accuracy, Speed, and Trade-offs

YOLOv4-Tiny achieves diverse real-world performance metrics, always with a focus on balancing accuracy and inference speed:

Model	mAP (COCO)	FPS (2080Ti)	Model Size	Inference Hardware
YOLOv4-Tiny	22.0%	443	22.5 MB	RTX2080Ti (FP32)
YOLOv4-Tiny-3l	28.7%	252	–	RTX2080Ti (320x320)
Improved (Jiang et al., 2020)	38.0%	294	1,003 MB (VRAM)	1080Ti
Mask detection	85.31% (IoU.5)	50.66	~23 MB	Tesla K80
INT8 Quantized	Within 0.5% of FP32	~5.5	6.4 MB	RPi5, TFLite

A typical deployment result is 300–440 FPS (FP32) on RTX2080Ti, 25–50 FPS on A100 or K80, and 5–6 FPS on ARM CPUs (Raspberry Pi); INT8 quantization additionally halves runtime and power per image to ~4–14 W on ARM edge boards, without accuracy loss (Wang et al., 2020, Boddu et al., 10 Jun 2025, Boddu et al., 10 Jun 2025, Khoramdel et al., 2023). On the Raspberry Pi 5, INT8 models are measured at 28.2 ms/image with 13.85 W average power (Boddu et al., 10 Jun 2025).

A trade-off is observed: YOLOv4-Tiny achieves half the AP of full YOLOv4, but at 5–6× the speed (Bochkovskiy et al., 2020, Wang et al., 2020).

4. Deployment in Embedded and Edge Environments

YOLOv4-Tiny is widely deployed in applications where compute, memory, and energy are constrained:

Low-power inference: INT8-quantized YOLOv4-Tiny achieves real-time rates (5–6 FPS) on ARM Cortex-A76 (Raspberry Pi 5), with up to 71% model-size reduction and 59% power savings relative to FP32 (Boddu et al., 10 Jun 2025, Boddu et al., 10 Jun 2025).
Mobile embedded: Single-threaded inference is "real-time" for many surveillance and monitoring tasks on Raspberry Pi 4/5 CPUs (Verma et al., 2024).
Healthcare/PPE: Embedded systems for real-time PPE compliance are feasible, leveraging the streamlined YOLOv4-Tiny and, optionally, fine-tuning with custom donning/doffing logic, as in (Verma et al., 2024).

The conversion from Darknet to ONNX, TensorRT, or TFLite is direct and preserves accuracy due to the absence of exotic operations (no Mish activation, SPP, or DropBlock) (Wang et al., 2020, Khoramdel et al., 2023).

5. Enhancements and Variants

Several works enhance YOLOv4-Tiny using structural and multi-scale fusion modules:

Raw Feature Collection and Redistribution (RFCR): Directly combines features from backbone scales via soft attention, followed by MBConv layers, boosting AP₅₀ by ~1 point on COCO for a minor speed penalty (Ganesh et al., 2021).
Backbone Truncation: Applying truncated, transfer-learned backbones (e.g., MobileNetV2x0.75, pruning classification-only layers) reduces memory and computation with minimal mAP loss (Ganesh et al., 2021).
Attention-Aware Auxiliary Blocks: Auxiliary side branches with 5×5 (via stacked 3×3) receptive field and CBAM-style channel+spatial attention can replace early CSP blocks, recovering accuracy after heavy pruning while yielding 9% speedup (on GPU) and up to 72% on Raspberry Pi (Jiang et al., 2020).

Empirical results consistently show that deletion of computationally expensive modules, judicious feature fusion, and systematic quantization are key to maximizing efficiency–accuracy trade-offs in YOLOv4-Tiny deployment scenarios.

6. Practical Considerations and Application Domains

YOLOv4-Tiny is validated across domains:

Aerial and Emergency Response: Used for drone-based vehicle/incident detection, benefitting from low memory, fast inference, and robust quantization (Boddu et al., 10 Jun 2025, Boddu et al., 10 Jun 2025).
Health and PPE Compliance: Real-time detection of PPE donning/doffing provides actionable feedback via edge-only inference (Verma et al., 2024).
Pandemic Response: Face mask detection in COVID-19 contexts demonstrates high mAP and real-time throughput (~85.3% mAP, 50.66 FPS) using compact (23 MB) models (Khoramdel et al., 2023).

Application-specific limitations include class imbalance (e.g., minority classes in aerial datasets), reduced robustness to severe occlusion or low-resolution inputs, and sensitivity to hyperparameter tuning. Quantization and pruning have been shown to yield further improvements in memory/power efficiency at acceptably small costs in detection performance (Boddu et al., 10 Jun 2025, Boddu et al., 10 Jun 2025).

7. Limitations and Future Directions

Limitations of YOLOv4-Tiny include a significant accuracy drop relative to full-scale detectors on challenging datasets (COCO: 22.0% AP vs. 43.5%), reduced capacity to detect multiple small objects per image (due to absent or shallow multi-scale heads), and increased class confusion for rare categories. Future improvements articulated in the literature include advanced quantization (mixed-precision), structured pruning, richer data augmentation, class balancing strategies, and explainability modules for operator insight (e.g., Grad-CAM) (Boddu et al., 10 Jun 2025). The efficient methodology for exporting and deploying YOLOv4-Tiny models remains generalizable to other tasks and domains (Boddu et al., 10 Jun 2025, Ganesh et al., 2021).

References:

(Wang et al., 2020, Khoramdel et al., 2023, Boddu et al., 10 Jun 2025, Boddu et al., 10 Jun 2025, Verma et al., 2024, Ganesh et al., 2021, Jiang et al., 2020, Bochkovskiy et al., 2020)