Fast YOLO: Speed-Optimized Detection Models

Updated 14 March 2026

Fast YOLO is a family of detection models that streamline the classic YOLO architecture for high-speed inference and efficient resource use.
They employ techniques such as network pruning, compact backbones, dynamic routing, and quantization to balance the trade-off between speed and accuracy.
These design strategies enable real-time object detection in applications ranging from mobile devices and industrial inspection to medical imaging.

Fast YOLO refers to a class of YOLO-derived object detectors optimized for maximal inference speed and high throughput, typically with targeted design constraints for embedded, mobile, and low-resource environments. These networks exploit architectural simplification, parameter reduction, adaptive computation, quantization, custom fusion modules, or hardware-aware design to achieve an improved speed–accuracy–resource trade-off relative to canonical YOLO models.

1. Evolution of Fast YOLO Architectures

The concept of Fast YOLO dates to the introduction of a trimmed-down YOLO network ("Fast YOLO") that delivers 155 fps with double the mAP of prior real-time detectors, achieved by reducing depth, number of filters per layer, and size of the output head compared to full YOLO (Redmon et al., 2015). This approach was systematized in subsequent models using explicit architectural pruning, parametric search, and dynamic adaptation.

Key historical examples include:

Fast YOLO (YOLO v1): 9 convolutional layers vs. 24 in full YOLO; strong simplification of the prediction grid; prioritization of inference speed (Redmon et al., 2015).
O-YOLOv2 / Fast YOLO (YOLO v2 variant): Evolutionary pruning for parameter count minimization and motion-adaptive skipping for real-time video on embedded devices (Shafiee et al., 2017).
Tiny YOLO and derivatives: Aggressive channel-width and filter reduction, often at a significant accuracy penalty, with successors introducing further compression and quantization for microcontrollers, e.g., xYOLO, μYOLO, LeYOLO, LF-YOLO (Barry et al., 2019, Deutel et al., 2024, Hollard et al., 2024, Liu et al., 2021).
Recent advances: Instance-conditional computation via sparse Mixture-of-Experts (MoE), reparameterization, and hardware-oriented fusion, e.g., YOLO-Master, RCS-YOLO (Lin et al., 29 Dec 2025, Kang et al., 2023).

2. Architectural Techniques for Acceleration

The archetypal Fast YOLO architecture employs a blend of the following acceleration strategies:

Network Pruning and Evolutionary Synthesis: Automatic removal of less informative filters/synapses (O-YOLOv2), guided by accuracy-preserving objectives (IoU retention), with parameter count reductions up to 2.8× and only minor IoU loss (~2%) (Shafiee et al., 2017).
Compact Backbones and Feature Extractors: Use of slim architectures (e.g., Darknet-19, Darknet-20, MobileNet variants), lightweight inverted bottleneck modules, and Depthwise Separable Convolutions (Betti, 2022, Hollard et al., 2024).
Split-Transform-Merge (Ghost/EFE Modules): In LF-YOLO, Efficient Feature Extraction (EFE) divides activations into “identity” (reused) and “expand” (Ghost Conv) branches, merging for high representational efficiency at low compute cost (Liu et al., 2021).
Reparameterization and Channel Shuffle: Training-time dense/multi-branch blocks (RCS/RepVGG) are algebraically fused at inference to single-path Conv + Shuffle, achieving large speedups without diluting training expressivity (RCS-YOLO) (Kang et al., 2023).
Microcontroller-Targeted Compression: μYOLO (microYOLO) employs depthwise-separable convolutions, extreme parameter pruning (~800 KB flash), and 8-bit quantization for on-MCU execution at 3–4 fps (Deutel et al., 2024).
Instance-Conditional Sparse MoE: YOLO-Master dynamically gates each spatial “instance” to a (tiny) subset of specialized Transformer experts for adaptive compute allocation; diversity loss discourages gate collapse (Lin et al., 29 Dec 2025).

3. Loss Function and Detection Heads

Fast YOLO models consistently retain the core grid-based, direct-regression paradigm of YOLO, with sum-squared error loss combining localization, objectness, and classification terms. Most use the canonical YOLO loss structure:

$\begin{align*} &\mathcal{L} = \lambda_\mathrm{coord} \sum_{i=1}^{S^2} \sum_{j=1}^B \mathbb{1}_{ij}^\mathrm{obj}[(x_{ij}-\hat{x}_{ij})^2 + (y_{ij}-\hat{y}_{ij})^2] \ &+ \lambda_\mathrm{coord} \sum_{i=1}^{S^2} \sum_{j=1}^B \mathbb{1}_{ij}^\mathrm{obj}[(\sqrt{w_{ij}}-\sqrt{\hat{w}_{ij}})^2 + (\sqrt{h_{ij}}-\sqrt{\hat{h}_{ij}})^2] \ &+ \sum_{i=1}^{S^2} \sum_{j=1}^B \mathbb{1}_{ij}^\mathrm{obj}(C_{ij}-\hat{C}_{ij})^2 \ &+ \lambda_\mathrm{noobj} \sum_{i=1}^{S^2} \sum_{j=1}^B \mathbb{1}_{ij}^\mathrm{noobj}(C_{ij}-\hat{C}_{ij})^2 \ &+ \sum_{i=1}^{S^2} \mathbb{1}_{i}^\mathrm{obj} \sum_{c=1}^C(p_i(c) - \hat{p}_i(c))^2 \end{align*}$

where all terms and notations conform to the YOLO conventions for grid size $S$ , boxes per cell $B$ , and class count $C$ (Redmon et al., 2015, Shafiee et al., 2017, Barry et al., 2019, Deutel et al., 2024).

Advanced heads (YOLO-Master, LeYOLO) may incorporate discrete distributional regression (e.g., DFL), parametric decoupling of cls/reg heads, or anchor re-clustering, but always operate in a fully convolutional, end-to-end trainable fashion (Lin et al., 29 Dec 2025, Hollard et al., 2024).

4. Quantitative Performance and Benchmarks

Empirical benchmarks highlight characteristic trade-offs of Fast YOLO variants. Representative results:

Model	mAP50:95 (%)	FPS / Latency	Parameter Count	Flops	Reference
Fast YOLO (v1)	52.7	155 fps (VOC07)	–	–	(Redmon et al., 2015)
O-YOLOv2	65.1 IoU	11.8 fps (TX1)	17.1M	–	(Shafiee et al., 2017)
Fast YOLO(v2)	65.1 IoU	17.8 fps (TX1)	17.1M	–	(Shafiee et al., 2017)
YOLO-Master-S	45.6	5.2 ms @224x224	–	–	(Lin et al., 29 Dec 2025)
YOLO-Master-M	51.8	9.6 ms @224x224	–	–	(Lin et al., 29 Dec 2025)
LF-YOLO (1.0×)	47.8 (COCO)	61.5 fps (X-ray)	7.3M	4.0G (weld)	(Liu et al., 2021)
YOLO-S	46.7 (AIRES)	8–25 fps (RTX/CPU)	7.85M	34.6 BFLOPs	(Betti, 2022)
LeYOLO-Small@640	38.2	24 fps (TX2, TRT)	1.9M	4.51 GFLOPs	(Hollard et al., 2024)
RCS-YOLO	94.6 (AP50)	114.8 fps (Br35H)	45.7M	94.5G	(Kang et al., 2023)
μYOLO	~56.4 (cls)	3.49 fps (OpenMV M7)	<1M (post-prune)	–	(Deutel et al., 2024)
xYOLO	66.8	9.66 fps (rPi3 B)	0.82M	0.039 BFLOPs	(Barry et al., 2019)

Fast YOLO variants consistently outperform their unpruned or generic baselines in terms of fps and resource occupancy while accepting a sharply tuned accuracy trade-off. Modern adaptations (YOLO-Master, RCS-YOLO) even surpass full-size YOLOs in both accuracy and latency by leveraging instance-conditional compute and computation-graph reparameterization (Lin et al., 29 Dec 2025, Kang et al., 2023).

5. Adaptive and Hardware-Specific Design Strategies

Fast YOLO detectors for embedded or edge deployment exploit several adaptive design mechanisms:

Motion-Adaptive Inference: Skip models runs on frames with little change (Fast YOLO v2), using a $1 \times 1$ conv-based motion map and running deep inference only on non-trivial inputs—38% reduction in deep inference (Shafiee et al., 2017).
Dynamic Routing / Instance-Conditional Compute: ES-MoE blocks in YOLO-Master activate only relevant experts for a given spatial position, guided by a trainable routing MLP, yielding targeted FLOP expenditure and substantially reduced runtime in simple scenes (Lin et al., 29 Dec 2025).
Split-Scale Network Heads and Feature Fusion: Heads and feature fusers (e.g., FPANet in LeYOLO, RMF in LF-YOLO) are pruned to depthwise/pointwise convs, Ghost convolutions or minimal pyramidal fusion units to accelerate both forward and fusion time (Hollard et al., 2024, Liu et al., 2021).
Binary/Quantized Layers and Pruning: Selective use of XNOR (binary) convolutions and 8-bit quantization, especially in later stages, ensures minimal memory footprint and inference time at the modest penalty of representational richness, suitable for MCUs and very low-end CPUs (Barry et al., 2019, Deutel et al., 2024).

6. Applications and Deployment Domains

Fast YOLO has driven real-time object detection across diverse low-resource environments:

Video object detection on mobile GPUs / embedded SoCs: e.g., Fast YOLO achieving ~18 fps on Nvidia Jetson TX1 (Shafiee et al., 2017).
Industrial inspection: LF-YOLO in weld-defect X-ray analysis, >60 fps on modern GPU (GTX 2070), >30 fps FP16 on portable edge accelerators (Jetson Xavier NX) (Liu et al., 2021).
Aerial image analysis with small targets: YOLO-S for helicopter-view car/person detection, working robustly at small object sizes with fast inference (Betti, 2022).
Medical imaging: RCS-YOLO for brain tumor detection, achieving ~115 fps on RTX 3090 and outperforming YOLOv6/YOLOv7/YOLOv8 in AP₅₀ (Kang et al., 2023).
Microcontroller and IoT: μYOLO, LeYOLO-Nano, xYOLO handle sub-1 MB flash requirements and run on ARM Cortex-M series or Raspberry Pi with substantially reduced (sub-10 MB) memory footprints (Deutel et al., 2024, Hollard et al., 2024, Barry et al., 2019).

A plausible implication is that further integration of dynamic computation (routing, reparameterization), model compression (pruning, quantization), and hardware-tailored operators will define the ongoing evolution of Fast YOLO models.

7. Limitations and Future Directions

The trade-off for extreme speed or resource optimization manifests primarily as reduced localization precision, decreased recognition of small or overlapping objects (due to coarser grids or shallow heads), and diminished robustness in highly complex scenes. Approaches such as sparse MoE, reparameterization, and advanced neck fusion (FPANet) offer mitigation but may introduce hardware constraints or parallelization bottlenecks. Further work is likely on optimizing channel expansion, dynamic kernels, and fused operator design to maintain the speed–accuracy Pareto frontier as hardware and application requirements evolve (Lin et al., 29 Dec 2025, Hollard et al., 2024).

In summary, Fast YOLO encompasses a spectrum of methods unified by an emphasis on high inference speed, efficient computation, and architecture–hardware co-design, maintaining competitive accuracy for real-time object detection across a range of application contexts (Redmon et al., 2015, Shafiee et al., 2017, Lin et al., 29 Dec 2025, Liu et al., 2021, Kang et al., 2023, Hollard et al., 2024, Deutel et al., 2024, Betti, 2022, Barry et al., 2019).