MLPerf Tiny Benchmark

Updated 27 January 2026

MLPerf Tiny Benchmark is an industry-standard suite that evaluates ultra-low-power neural network inference on resource-constrained devices with strict power, memory, and accuracy targets.
It employs a modular workflow with Python training scripts, model quantization pipelines, and C/C++ firmware to enable consistent, reproducible measurements across various hardware platforms.
The benchmark drives research in model compression, quantization strategies, and energy-efficient on-device inference, fostering innovation in both academic and industry settings.

MLPerf Tiny Benchmark is an industry-standard suite for measuring the inference performance—accuracy, latency, and energy—of ultra-low-power neural networks on resource-constrained embedded platforms. Developed through collaboration among over fifty organizations from academia and industry, MLPerf Tiny fills a previously unmet need for reproducible, apples-to-apples benchmarks targeting sub-milliWatt microcontroller-class devices and specialized inference accelerators. The suite supports both strictly controlled (closed) and flexible (open) benchmarking divisions and has directly enabled research into model compression, quantization, and efficient on-device execution (Banbury et al., 2021, Banbury et al., 2020).

1. Motivations and Design Objectives

MLPerf Tiny addresses the absence of standardized evaluation infrastructure for TinyML, defined as ML inference under 1 mW active power on embedded hardware. Its primary goal is to enable fair comparison across hardware, software, and algorithmic stacks by measuring cost–performance trade-offs with high granularity.

Key requirements covered by the benchmark are:

Sub-milliWatt operational range (μW–mW), reflecting target application constraints in battery-powered and always-on edge devices.
Strict RAM (≤512 kB) and flash (≤1 MB) resource thresholds, chosen to be compatible with contemporary MCU-class parts.
Support for a wide spectrum of hardware (MCUs, DSPs, FPGAs, ASICs, and in-memory compute chips) and software stacks (from hand-optimized C to full compilers and interpreters).
Explicit handling of measurement challenges, such as end-to-end energy profiling and memory/latency reporting (Banbury et al., 2021, Banbury et al., 2020).

2. Benchmark Suite Structure and Workflow

MLPerf Tiny implements a modular benchmarking setup designed for maximal reproducibility and adaptability:

Each benchmark includes Python training scripts (typically TensorFlow or PyTorch), reference model quantization pipelines (TFLite or ONNX-based), and C/C++ implementations for on-device inference (TFLite Micro or custom backends).
The reference codebase is divided into five main components: dataset preprocessing, model training, model conversion/quantization, firmware inference code, and the hardware device under test (DUT).
The measurement harness consists of an energy meter (e.g., Joulescope or LPM01A), host/target communication (serial/GPIO with synchronization), and an automated GUI for control and result logging.

Closed division submissions must use the reference datasets, models, and quality targets with post-training quantization only, ensuring strict comparability. Open division submissions may alter the architecture, quantization, or training but must report performance on the standard test datasets, providing documentation of any changes (Banbury et al., 2021).

3. Benchmark Tasks and Model Reference Set

MLPerf Tiny v0.5/v0.7 comprises four canonical TinyML workloads:

Task	Dataset/Input	Reference Model	Quality Target (Closed)
Keyword Spotting (KWS)	SpeechCommands v1/v2	DS-CNN (38.6K params, 49×10 in)	≥ 90% top-1 accuracy
Visual Wake Words (VWW)	MSCOCO/VWW (96×96 RGB)	MobileNetV1-0.25× (325K params)	≥ 80% top-1 accuracy
Image Classification (IC)	CIFAR-10 (32×32×3)	Mini-ResNet (96K params)	≥ 85% top-1 accuracy
Anomaly Detection (AD)	ToyADMOS (5×128)	FC-autoencoder (270K params)	ROC-AUC ≥ 0.85

Input preprocessing pipelines are defined, including MFCCs for KWS, single-frame normalization for VWW, image resizing for IC, and spectrogram windowing for AD.
Baseline models use 8-bit quantization in the closed division. Benchmark-specific architectures include DS-CNN for KWS, MobileNetV1 (depthwise-separable) for VWW, reduced-depth ResNet for IC, and deep FC autoencoders for AD (Banbury et al., 2021, Ghanathe et al., 2022).
Open-division examples include binary CNNs and mixed-precision quantization (1–12 bits per layer), as enabled via QAT flows (Borras et al., 2022).

4. Metrics, Measurement Methodology, and Instrumentation

Performance evaluation centers on:

Accuracy: Top-1 classification or ROC-AUC for AD, computed as

$\text{Accuracy} = \frac{\#\{\text{correct predictions}\}}{N}$

Latency: Median time per inference, measured as

$L = T_\mathrm{end} - T_\mathrm{start}$

for $N$ trials (minimum 10), reported in ms.

Energy per Inference: Average energy consumed per sample,

$E_\text{inf} = \frac{E_\text{total}}{N}$

with $E_\text{total}$ obtained either by integrating instantaneous power or as $(I_\text{active} - I_\text{idle}) \cdot V_\text{supply} \cdot T_\text{inf}$ (Borras et al., 2022).

Throughput: Inferences per second, $\text{IPS} = 1/L$ .

Composite metrics such as Energy–Delay Product (EDP: $E_\text{inf} \times L$ ) are also reported.

The on-device protocol requires strict isolation of power measurement to exclude host/communication overhead, ensures all model weights and activations reside in on-chip memory, and logs full configuration data for reproducibility (Banbury et al., 2021). Energy measurement is synchronized via GPIO edge tracing (Borras et al., 2022).

5. Representative Implementations and Results

The benchmark suite has catalyzed open-source implementations across a variety of platforms:

Reference MCUs: ST NUCLEO-L4R5ZI (Cortex-M4@80 MHz, 1 MB flash, 128 kB RAM) achieves for KWS ≈ 1,000 inf/s and ≈100 µJ/inference; VWW ≈120 inf/s, ≈1.2 mJ/inference; IC ≈100 inf/s, ≈900 µJ; AD ≈25 inf/s, ≈1.5 mJ, all at or above target accuracies (Banbury et al., 2021).
FPGA Open-Division Submissions: hls4ml and FINN workflows implement, for example, image classification with “tiny-ResNet-8” (58K params, 83.5% accuracy, 8-bit QAT) and binary CNNs (CNV-W1A1, 84.5%), and KWS with MLPs at 3-bit quantization (82.5%) (Borras et al., 2022). Empirical energy per inference for AD/KWS can drop below 30 µJ at ≈20–33 µs latency on Arty A7-100T.
Early-Exit Architectures: T-RecX demonstrates, across ResNet-8/CIFAR-10, DS-CNN/SpeechCommands, and MobileNetV1/VWW, that inserting a single intermediate classifier can reduce average FLOPS by 20%–38% for ≤1% accuracy loss, outperforming BranchyNet and SDN in parameter efficiency and realized FLOPS reductions (Ghanathe et al., 2022).

FPGA and NP accelerators demonstrate order-of-magnitude improvements in inference rate (IPS>5K) and energy, at the expense of higher quiescent power and toolchain customization.

6. Submission, Scoring, and Reproducibility Protocols

MLPerf Tiny requires all submitters, in both divisions, to:

Provide full source code, binary builds, and calibration logs.
Document measurement scripts and dataset/model modifications (open division).
Report median latency and energy per inference over at least five independent trials.
Offer a compliance checklist covering model, activation, and stack RAM/flash footprints.

Primary scoring metric is latency (closed division), with energy per inference as tiebreaker, and optional composite EDP ranking. All closed-division entries are validated by independent re-building on pre-specified reference hardware (e.g., STM32H7, Arduino Nano 33 BLE), ensuring exact reproducibility (Banbury et al., 2021, Banbury et al., 2020).

7. Research Impact, Challenges, and Future Directions

The MLPerf Tiny benchmark has become a central infrastructure for research on TinyML co-design, model compression, and hardware/software codesign techniques. Its modular, open framework enables:

Comparative study of quantization granularity effects, e.g., per-layer mixed-precision, binary/ternary networks, quantization-aware training versus post-training quantization.
Transferability of methods such as layer fusion, spatial streaming, and activation pipelining from FPGAs to MCUs and ASIC accelerators (Borras et al., 2022).
Early-exit and conditional computation empirical study on resource-constrained networks (Ghanathe et al., 2022).

Ongoing technical challenges include: accurate ultra-low-power measurement (μW–mW dynamic range), memory and code-size constraints, and fair normalization across diverse hardware and software stacks. Present limitations are the focus on four core tasks, the closed-division restriction to 8-bit quantization, and the lack of multi-core/heterogeneous pipeline or genuinely streaming workloads.

Planned extensions propose: real-time streaming detection (e.g., fall detection), expanded evaluation of binary/mixed-precision models, new task types (object/NLP detection, multi-sensor fusion), refined power domain partitioning, and addition of quality-of-service metrics such as false-alarm rates. All tools, benchmarks, and rules are continuously stewarded via the MLCommons open repository and working group structure (Banbury et al., 2021, Banbury et al., 2020, Borras et al., 2022).

Relevant references:

"MLPerf Tiny Benchmark" (Banbury et al., 2021)
"Benchmarking TinyML Systems: Challenges and Direction" (Banbury et al., 2020)
"Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark" (Borras et al., 2022)
"T-RECX: Tiny-Resource Efficient Convolutional neural networks with early-eXit" (Ghanathe et al., 2022)

Markdown Report Issue Upgrade to Chat

References (4)

MLPerf Tiny Benchmark (2021)

Benchmarking TinyML Systems: Challenges and Direction (2020)

T-RECX: Tiny-Resource Efficient Convolutional neural networks with early-eXit (2022)

Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MLPerf Tiny Benchmark.

MLPerf Tiny Benchmark

1. Motivations and Design Objectives

2. Benchmark Suite Structure and Workflow

3. Benchmark Tasks and Model Reference Set

4. Metrics, Measurement Methodology, and Instrumentation

5. Representative Implementations and Results

6. Submission, Scoring, and Reproducibility Protocols

7. Research Impact, Challenges, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MLPerf Tiny Benchmark

1. Motivations and Design Objectives

2. Benchmark Suite Structure and Workflow

3. Benchmark Tasks and Model Reference Set

4. Metrics, Measurement Methodology, and Instrumentation

5. Representative Implementations and Results

6. Submission, Scoring, and Reproducibility Protocols

7. Research Impact, Challenges, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research