YOLOBench: Embedded YOLO Detection Benchmark

Updated 14 March 2026

YOLOBench is a comprehensive benchmarking suite for over 550 YOLO-based one-stage detectors on embedded systems.
It employs a uniform training and inference protocol to objectively compare model accuracy and latency across diverse hardware platforms.
The framework also integrates zero-cost accuracy estimators to facilitate rapid neural architecture search in resource-constrained environments.

YOLOBench is a comprehensive, large-scale benchmarking suite designed for objective evaluation and comparison of over 550 YOLO-based one-stage object detectors specifically on embedded systems. It provides a unified experimental protocol to assess accuracy versus inference latency across diverse hardware targets and application datasets using a strictly controlled training regime. The benchmark centers on Pareto-optimality analysis, enabling principled investigation of trade-offs between model accuracy and real-world deployment efficiency. Additionally, YOLOBench incorporates systematic assessments of zero-cost accuracy estimators for neural architecture search (NAS), identifying practical surrogates for rapid YOLO variant selection in resource-constrained inference scenarios (Lazarevich et al., 2023).

1. Scope and Design Rationale

YOLOBench addresses the absence of standardized benchmarks tailored to the deployment characteristics of object detectors on embedded and edge-class hardware. Unlike prior academic benchmarks that focus primarily on classification or rely on isolated, hand-picked YOLO variants, YOLOBench systematically evaluates a broad cross-section of model architectures, scales, and training configurations. The evaluated detector family spans YOLOv3, YOLOv4, YOLOv5, YOLOv6 (v3.0), YOLOv7, and YOLOv8, encompassing a wide range of architectural backbones (e.g., DarkNet53, CSPDNet53, EfficientRep, E-ELAN), neck designs, and detection heads. Each model is tested at multiple width and depth multipliers and across different input image resolutions, yielding thorough coverage of the design space relevant for real-time and low-latency applications.

2. Benchmark Methodology and Evaluated Models

The evaluation protocol enforces a tightly controlled training and inference environment to ensure cross-model comparability. All models are trained using a uniform codebase (Ultralytics YOLOv8 and Deeplite-Torch-Zoo), applying identical detection heads (anchor-free, decoupled), loss functions (CIoU and Distribution Focal Loss), optimizers, and data augmentation schemes. No ImageNet-pretrained backbones are used, ensuring the observed performance is attributable to detector architecture and not initialization artifacts.

Models are trained on MS COCO train2017 for pretraining (over 300 epochs), followed by fine-tuning on domain-specific datasets (PASCAL VOC, SKU-110k, WIDER FACE) where appropriate. Fine-tuning is performed with best COCO-pretrained weights, consistent batch sizes, and rigorous selection of optimal checkpoints based on mAP@[.50:.95]. Model scale variants—achieved via width multipliers (0.25, 0.5, 0.75, 1.0), depth multipliers (0.33, 0.67, 1.0), and input resolutions (from 160×160 to 480×480)—enable fine-grained exploration of computational and accuracy trade-offs (Lazarevich et al., 2023).

3. Hardware Platforms and Latency Measurement

YOLOBench targets inference on four representative embedded hardware classes:

Hardware Platform	Inference Framework	Precision
NVIDIA Jetson Nano GPU	ONNX Runtime	FP32
Khadas VIM3 NPU	AML NPU SDK	INT16
Raspberry Pi 4 Model B CPU	TFLite + XNNPACK	FP32
Intel Core i7-10875H CPU	OpenVINO	FP32

For each platform, latency is measured as the mean inference time/image over 200 runs (warmup: 5), batch size = 1. Post-processing (NMS, box decoding) is excluded where not supported (NPU).

4. Metrics and Pareto Optimality Analysis

YOLOBench quantitatively evaluates candidates along three primary axes:

Accuracy: mAP@[.50:.95], as defined by $mAP = \frac{1}{|T|} \sum_{t \in T} AP_t$ .
Computational Complexity: Multiply-accumulate operations per model ( $MACs = \sum_{l=1}^L H_l W_l K_l^2 C_{l-1} C_l$ ).
Latency: Hardware-measured inference time.

Pareto-optimality is defined such that model $A$ dominates $B$ iff $(latency_A \leq latency_B$ and $mAP_A \geq mAP_B)$ , with at least one strict inequality. The first Pareto frontier is the set of non-dominated models per device/dataset configuration.

Proxy-based preselection is employed to efficiently explore the architecture/resolution trade space: approximately 1,000 candidates are trained from scratch on VOC, and the top-2 Pareto fronts by VOC mAP are selected for full multi-dataset, multi-resolution evaluation (~52 backbone/neck combinations propagate to the COCO stage, yielding 572 models/dataset). Pareto fronts are computed using OApackage (Lazarevich et al., 2023).

5. Main Results and Observations

Analysis reveals that under a harmonized training protocol, multiple YOLO versions—spanning v3 through v8—appear on Pareto frontiers, sometimes counter to prior assumptions about the obsolescence of older architectures. Depth and width scaling reductions are generally prioritized over input resolution decreases to meet strict latency constraints. Notable findings include:

On Jetson Nano GPU, YOLOv5-v8 variants are equally represented on optimal fronts under 100 ms latency ( $\sim$ 0.556 mAP for YOLOv6ₛ with $d_{0.67}w_{0.25}$ at 480 resolution).
On Khadas VIM3 NPU, YOLOv6 variants dominate for sub-50 ms latency.
On ARM and x86 CPUs, mixes of YOLOv5, YOLOv7, and some YOLOv3/v4 at higher latency are optimal.
Even older architectures (YOLOv3/v4) remain Pareto-efficient at higher latencies or on certain devices.
In case studies (e.g., Raspberry Pi 4), alternative backbones (FBNetV3-D+PAN-C3) identified via zero-cost proxies outperform YOLOv8-small both in speed and mAP (test mAP: 43.87% vs. 43.17%; latency: 1,355 ms vs. 1,476 ms) (Lazarevich et al., 2023).

6. Zero-Cost Accuracy Estimators for Neural Architecture Search

YOLOBench rigorously benchmarks common zero-cost (training-free) accuracy estimators including Jacobian Covariance, ZiCo, Zen, Fisher, SNIP, SynFlow, NWOT, MAC count, and parameter count. Their ranking fidelity is measured using Kendall’s $\tau$ (overall and on upper-mAP quantiles) and recall of actual Pareto models. While the MAC count predictor outperforms most methods (global $\tau$ = 0.739, Pareto recall 12.3%), pre-activation NWOT substantially improves recall (global $\tau$ = 0.827, Pareto recall 29.2%), approaching the VOC scratch mAP proxy ( $\tau$ = 0.847, recall 36.9%). By considering multiple zero-cost Pareto fronts, up to 90% of true Pareto models are recoverable after evaluating only 30% of model candidates, highlighting the practical utility of these proxies in NAS settings (Lazarevich et al., 2023).

7. Conclusions and Recommendations

YOLOBench establishes reproducible, transparent baselines for multi-platform, multi-dataset evaluation of YOLO-based detectors. Core findings include:

Even when using a uniform training environment, legacy YOLO models can remain competitive, especially on under-reported hardware constraints.
Device and latency-targeted model selection should favor width and depth scaling before reducing resolution.
For sub-100 ms GPU deployment, YOLOv6ₛ/m and YOLOv5 are preferred; for NPUs, YOLOv6 variants are optimal at low latency; for CPUs, larger YOLOv6 or YOLOv7 models suffice within 500 ms.
Pre-activation NWOT provides a practical, rapid proxy for zero-cost NAS candidate discovery.
All code, evaluation data, and full model statistics are open-sourced, supporting further NAS research and efficient deployment design (Lazarevich et al., 2023).

A plausible implication is that YOLOBench can facilitate both device-aware detector design and principled NAS strategy development, especially when extending to new hardware or emerging YOLO-style architectures. The methodology underscores the importance of systematic benchmarking as a prerequisite for fair, transferable NAS and model selection in the rapidly evolving object detection landscape.

Markdown Report Issue Upgrade to Chat

References (1)

YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YOLOBench.

YOLOBench: Embedded YOLO Detection Benchmark

1. Scope and Design Rationale

2. Benchmark Methodology and Evaluated Models

3. Hardware Platforms and Latency Measurement

4. Metrics and Pareto Optimality Analysis

5. Main Results and Observations

6. Zero-Cost Accuracy Estimators for Neural Architecture Search

7. Conclusions and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

YOLOBench: Embedded YOLO Detection Benchmark

1. Scope and Design Rationale

2. Benchmark Methodology and Evaluated Models

3. Hardware Platforms and Latency Measurement

4. Metrics and Pareto Optimality Analysis

5. Main Results and Observations

6. Zero-Cost Accuracy Estimators for Neural Architecture Search

7. Conclusions and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research