Papers
Topics
Authors
Recent
Search
2000 character limit reached

MLPerf Tiny Benchmark

Updated 27 January 2026
  • MLPerf Tiny Benchmark is an industry-standard suite that evaluates ultra-low-power neural network inference on resource-constrained devices with strict power, memory, and accuracy targets.
  • It employs a modular workflow with Python training scripts, model quantization pipelines, and C/C++ firmware to enable consistent, reproducible measurements across various hardware platforms.
  • The benchmark drives research in model compression, quantization strategies, and energy-efficient on-device inference, fostering innovation in both academic and industry settings.

MLPerf Tiny Benchmark is an industry-standard suite for measuring the inference performance—accuracy, latency, and energy—of ultra-low-power neural networks on resource-constrained embedded platforms. Developed through collaboration among over fifty organizations from academia and industry, MLPerf Tiny fills a previously unmet need for reproducible, apples-to-apples benchmarks targeting sub-milliWatt microcontroller-class devices and specialized inference accelerators. The suite supports both strictly controlled (closed) and flexible (open) benchmarking divisions and has directly enabled research into model compression, quantization, and efficient on-device execution (Banbury et al., 2021, Banbury et al., 2020).

1. Motivations and Design Objectives

MLPerf Tiny addresses the absence of standardized evaluation infrastructure for TinyML, defined as ML inference under 1 mW active power on embedded hardware. Its primary goal is to enable fair comparison across hardware, software, and algorithmic stacks by measuring cost–performance trade-offs with high granularity.

Key requirements covered by the benchmark are:

  • Sub-milliWatt operational range (μW–mW), reflecting target application constraints in battery-powered and always-on edge devices.
  • Strict RAM (≤512 kB) and flash (≤1 MB) resource thresholds, chosen to be compatible with contemporary MCU-class parts.
  • Support for a wide spectrum of hardware (MCUs, DSPs, FPGAs, ASICs, and in-memory compute chips) and software stacks (from hand-optimized C to full compilers and interpreters).
  • Explicit handling of measurement challenges, such as end-to-end energy profiling and memory/latency reporting (Banbury et al., 2021, Banbury et al., 2020).

2. Benchmark Suite Structure and Workflow

MLPerf Tiny implements a modular benchmarking setup designed for maximal reproducibility and adaptability:

  • Each benchmark includes Python training scripts (typically TensorFlow or PyTorch), reference model quantization pipelines (TFLite or ONNX-based), and C/C++ implementations for on-device inference (TFLite Micro or custom backends).
  • The reference codebase is divided into five main components: dataset preprocessing, model training, model conversion/quantization, firmware inference code, and the hardware device under test (DUT).
  • The measurement harness consists of an energy meter (e.g., Joulescope or LPM01A), host/target communication (serial/GPIO with synchronization), and an automated GUI for control and result logging.

Closed division submissions must use the reference datasets, models, and quality targets with post-training quantization only, ensuring strict comparability. Open division submissions may alter the architecture, quantization, or training but must report performance on the standard test datasets, providing documentation of any changes (Banbury et al., 2021).

3. Benchmark Tasks and Model Reference Set

MLPerf Tiny v0.5/v0.7 comprises four canonical TinyML workloads:

Task Dataset/Input Reference Model Quality Target (Closed)
Keyword Spotting (KWS) SpeechCommands v1/v2 DS-CNN (38.6K params, 49×10 in) ≥ 90% top-1 accuracy
Visual Wake Words (VWW) MSCOCO/VWW (96×96 RGB) MobileNetV1-0.25× (325K params) ≥ 80% top-1 accuracy
Image Classification (IC) CIFAR-10 (32×32×3) Mini-ResNet (96K params) ≥ 85% top-1 accuracy
Anomaly Detection (AD) ToyADMOS (5×128) FC-autoencoder (270K params) ROC-AUC ≥ 0.85

4. Metrics, Measurement Methodology, and Instrumentation

Performance evaluation centers on:

  • Accuracy: Top-1 classification or ROC-AUC for AD, computed as

Accuracy=#{correct predictions}N\text{Accuracy} = \frac{\#\{\text{correct predictions}\}}{N}

  • Latency: Median time per inference, measured as

L=TendTstartL = T_\mathrm{end} - T_\mathrm{start}

for NN trials (minimum 10), reported in ms.

  • Energy per Inference: Average energy consumed per sample,

Einf=EtotalNE_\text{inf} = \frac{E_\text{total}}{N}

with EtotalE_\text{total} obtained either by integrating instantaneous power or as (IactiveIidle)VsupplyTinf(I_\text{active} - I_\text{idle}) \cdot V_\text{supply} \cdot T_\text{inf} (Borras et al., 2022).

  • Throughput: Inferences per second, IPS=1/L\text{IPS} = 1/L.

Composite metrics such as Energy–Delay Product (EDP: Einf×LE_\text{inf} \times L) are also reported.

The on-device protocol requires strict isolation of power measurement to exclude host/communication overhead, ensures all model weights and activations reside in on-chip memory, and logs full configuration data for reproducibility (Banbury et al., 2021). Energy measurement is synchronized via GPIO edge tracing (Borras et al., 2022).

5. Representative Implementations and Results

The benchmark suite has catalyzed open-source implementations across a variety of platforms:

  • Reference MCUs: ST NUCLEO-L4R5ZI (Cortex-M4@80 MHz, 1 MB flash, 128 kB RAM) achieves for KWS ≈ 1,000 inf/s and ≈100 µJ/inference; VWW ≈120 inf/s, ≈1.2 mJ/inference; IC ≈100 inf/s, ≈900 µJ; AD ≈25 inf/s, ≈1.5 mJ, all at or above target accuracies (Banbury et al., 2021).
  • FPGA Open-Division Submissions: hls4ml and FINN workflows implement, for example, image classification with “tiny-ResNet-8” (58K params, 83.5% accuracy, 8-bit QAT) and binary CNNs (CNV-W1A1, 84.5%), and KWS with MLPs at 3-bit quantization (82.5%) (Borras et al., 2022). Empirical energy per inference for AD/KWS can drop below 30 µJ at ≈20–33 µs latency on Arty A7-100T.
  • Early-Exit Architectures: T-RecX demonstrates, across ResNet-8/CIFAR-10, DS-CNN/SpeechCommands, and MobileNetV1/VWW, that inserting a single intermediate classifier can reduce average FLOPS by 20%–38% for ≤1% accuracy loss, outperforming BranchyNet and SDN in parameter efficiency and realized FLOPS reductions (Ghanathe et al., 2022).

FPGA and NP accelerators demonstrate order-of-magnitude improvements in inference rate (IPS>5K) and energy, at the expense of higher quiescent power and toolchain customization.

6. Submission, Scoring, and Reproducibility Protocols

MLPerf Tiny requires all submitters, in both divisions, to:

  • Provide full source code, binary builds, and calibration logs.
  • Document measurement scripts and dataset/model modifications (open division).
  • Report median latency and energy per inference over at least five independent trials.
  • Offer a compliance checklist covering model, activation, and stack RAM/flash footprints.

Primary scoring metric is latency (closed division), with energy per inference as tiebreaker, and optional composite EDP ranking. All closed-division entries are validated by independent re-building on pre-specified reference hardware (e.g., STM32H7, Arduino Nano 33 BLE), ensuring exact reproducibility (Banbury et al., 2021, Banbury et al., 2020).

7. Research Impact, Challenges, and Future Directions

The MLPerf Tiny benchmark has become a central infrastructure for research on TinyML co-design, model compression, and hardware/software codesign techniques. Its modular, open framework enables:

  • Comparative study of quantization granularity effects, e.g., per-layer mixed-precision, binary/ternary networks, quantization-aware training versus post-training quantization.
  • Transferability of methods such as layer fusion, spatial streaming, and activation pipelining from FPGAs to MCUs and ASIC accelerators (Borras et al., 2022).
  • Early-exit and conditional computation empirical study on resource-constrained networks (Ghanathe et al., 2022).

Ongoing technical challenges include: accurate ultra-low-power measurement (μW–mW dynamic range), memory and code-size constraints, and fair normalization across diverse hardware and software stacks. Present limitations are the focus on four core tasks, the closed-division restriction to 8-bit quantization, and the lack of multi-core/heterogeneous pipeline or genuinely streaming workloads.

Planned extensions propose: real-time streaming detection (e.g., fall detection), expanded evaluation of binary/mixed-precision models, new task types (object/NLP detection, multi-sensor fusion), refined power domain partitioning, and addition of quality-of-service metrics such as false-alarm rates. All tools, benchmarks, and rules are continuously stewarded via the MLCommons open repository and working group structure (Banbury et al., 2021, Banbury et al., 2020, Borras et al., 2022).


Relevant references:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MLPerf Tiny Benchmark.