MCUBench: Benchmarking Object Detection on MCUs
- MCUBench is a benchmarking suite and methodology that evaluates YOLO-based object detection models on resource-constrained MCUs by measuring mAP, latency, RAM, and Flash usage.
- It standardizes model deployment through a uniform ONNX-to-TFLite quantization pipeline and supports both legacy and modern YOLO families across seven diverse MCU platforms.
- The framework extends to μNPU platforms, offering actionable insights for hardware-software co-design and exposing performance trade-offs in real-world edge AI applications.
MCUBench is a benchmarking suite and methodology for evaluating the end-to-end performance of object detection models on microcontroller units (MCUs) with stringent memory and compute constraints. It systematically quantifies the trade-offs among mean Average Precision (mAP), inference latency, RAM usage, and Flash memory footprint for over 100 YOLO-based model architectures across seven representative MCU platforms, under a uniform training and deployment pipeline (Sah et al., 2024). MCUBench also provides the basis for cross-target evaluation, supporting both legacy and modern object detection families and furnishing actionable insights for model selection under practical deployment constraints.
1. Benchmark Scope and Supported Hardware
MCUBench targets MCUs typical of edge deployments, encompassing a broad spectrum of architectures and memory configurations. It directly benchmarks the following seven MCU platforms, which vary by core type, frequency, and embedded/external memory composition:
| MCU Board | Core / Frequency | Flash (Int+Ext) | RAM (Int+Ext) |
|---|---|---|---|
| NUCLEO-H743ZI | Cortex-M7 @ 480 MHz | 2 MB | 1 MB |
| B-U585I-IOT02A | Cortex-M33 @ 160 MHz | 2 MB + 64 MB | 768 kB |
| STM32F469I-DISCO | Cortex-M4 @ 180 MHz | 2 MB + 16 MB | 384 kB + 16 MB |
| STM32F769I-DISCO | Cortex-M7 @ 216 MHz | 2 MB + 64 MB | 512 kB + 16 MB |
| STM32H573I-DK | Cortex-M33 @ 250 MHz | 2 MB + 64 MB | 640 kB |
| STM32H747I-DISCO | Dual M4+M7 @ 400 MHz | 1 MB + 128 MB | 704 kB + 8 MB |
| STM32L4R9I-DISCO | Cortex-M4 @ 120 MHz | 2 MB + 64 MB | 640 kB |
The inclusion of MCUs with a range of integrated and external Flash/RAM allows assessment of both ultra-tiny and memory-rich deployment scenarios.
2. YOLO Model Families, Scaling Factors, and Training Pipeline
MCUBench evaluates both legacy and modern YOLO one-stage object detector families, explicitly covering:
- Legacy: YOLOv3 (DarkNet53 + FPN), YOLOv4 (CSPDNet53 + SPP-PAN), YOLOv5 (CSPDNet53-C3 + SPPF-PAN-C3)
- Modern: YOLOv6s-3 (EfficientRep + RepBiFPAN), YOLOv7 (E-ELAN + SPPF-ELAN-PAN), YOLOv8 (CSPDNet53-C2f + SPPF-PAN-C2f)
Crucial scaling parameters include width (0.05–0.25, family-dependent), depth (0.085–0.25), activation (ReLU, SiLU), and input resolutions (128, 160, 192, 224). Each model is trained under a fixed regime: 100 epochs at 448×448 on PASCAL VOC (20 classes), using the anchor-free YOLOv8 detection head and a loss composed of Complete IoU (CIoU) for localization and Distributional Focal Loss (DFL) for box regression. Fine-tuning for each target input size (128–224) proceeds for an additional 10 epochs, with no ImageNet pretraining or hyperparameter changes.
Uniformity in the training pipeline and YOLO head permits controlled, architecture-agnostic comparisons on mAP@50, latency, RAM, and Flash metrics.
3. Evaluation Workflow and Measurement Methodology
Deployment is rigorously standardized. Models are exported through ONNX, converted to TensorFlow Lite with INT8, per-tensor quantization and UINT8 I/O, then compiled for XCubeAI targeting each MCU's Flash/RAM topology. Inference is performed with batch size 1, timing the full convolutional graph and bounding-box decoding on hardware. Non-Max Suppression (NMS) is omitted in all latency measurements.
RAM usage is the peak working set, including activations and intermediate tensors, as exported by XCubeAI, and partitioned into internal/external memory where applicable. The Flash metric sums model code plus weights across internal/external regions. All deployment leverages ST’s Developer Cloud REST API for automated parallel inference runs across boards.
4. Empirical Results and Pareto-Optimal Analysis
A four-step search is executed: 240 variants are initially trained and deployed at , yielding 818 successful evaluation runs and 159 device-specific Pareto-optimal models (mAP vs. latency). These sets are merged and fine-tuned at new input resolutions, producing 288 total fine-tuned models, and yielding 1,191 runs with 296 total device-model Pareto-optimal configurations; 131 of these are unique across all hardware.
Key quantitative highlights include:
- Fastest configuration (NUCLEO-H743ZI): YOLOv6 d0.85 w0.50 at 128×128, $0.10$ s latency, $0.08$ mAP
- Most accurate (STM32H747I-DISCO): YOLOv7 d2.50 w2.00 at 224×224, $2.21$ s latency, $0.41$ mAP
- Overall metric ranges: Latency  s, mAP
- Pareto front analysis:
- Low-latency: YOLOv6, small width/depth, ReLU, input
- High-accuracy: YOLOv7/YOLOv8, maximal width, input
- Legacy models (YOLOv3): Remain Pareto-optimal—especially on MCUs with large external Flash—when modernized with YOLOv8 head/training
- Example on B-U585I: Among 27 Pareto models, 15 YOLOv3, 9 YOLOv5, 2 YOLOv6, 1 YOLOv8
Model scaling insights:
- Increasing input resolution and width consistently boosts mAP (e.g., roughly doubles mAP)
- Depth scaling yields inconsistent accuracy returns and penalizes Flash and computation
- SiLU activation increases latency by $10$– and RAM by $10$–$30$ kB compared to ReLU
5. Model Selection Guidelines and Deployment Heuristics
MCUBench synthesizes experimental findings into actionable selection rules:
- Ultra-tiny applications (Flash < 512 kB, RAM < 64 kB, latency < 50 ms): YOLOv6 d0.85 w0.50 at 128×128 (ReLU), mAP ≈ 0.07–0.08
- Moderate accuracy (mAP ≈ 0.20, Flash < 1 MB, RAM < 256 kB): YOLOv6 d1.25 w0.85 at 160×160, latency ≈ 0.3–0.5 s
- High accuracy (mAP > 0.35, latency < 1 s, Flash < 2 MB): YOLOv7 d2.50 w1.60 at 224×224, mAP ≈ 0.35, latency ≈ 1.3 s (high-end M7)
Deployment recommendations:
- Prioritize input resolution and width before increasing depth
- Use ReLU unless marginal mAP gain from SiLU justifies increased resource demand
- Leverage external Flash/RAM to enable larger backbones/higher inputs
- Match target mAP/latency via MCUBench's published Pareto tables, avoiding retraining for most scenarios
6. MCUBench-Style Framework Extension to μNPU Platforms
The MCUBench methodology has been extended to microcontroller-scale neural processing units (μNPUs) as outlined in "Benchmarking Ultra-Low-Power NPUs" (Millar et al., 28 Mar 2025). This framework evaluates both standard MCUs and purpose-built μNPUs such as MAX78000, GAP8, HX6538 WE2, and others.
The μNPU benchmarking paradigm mirrors MCUBench in model and operator uniformity, INT8 per-tensor quantization, and standardized toolchain compilation. Performance metrics encompass latency (via on-chip timers), energy efficiency (inferences per mJ), and memory footprint (from compiled binaries), with disaggregation into initialization, memory I/O, inference, and post-processing stages.
Empirical results reveal that published peak compute specifications (GOPS) correlate poorly with actual end-to-end latency. Memory bandwidth and software stack maturity dominate real-world efficiency, with, for example, the MAX78000 outpacing a 512 GOPS μNPU for small models due to weight-stationary dataflow. Idle and boot-time power are frequently underreported by vendors, with discrepancies noted between specification and measurement.
A plausible implication is that MCUBench's uniform benchmarking standard both exposes and mitigates vendor-driven performance mischaracterization and supports hardware-software co-design by highlighting optimization bottlenecks not captured by synthetic or single-metric benchmarks.
7. Public Accessibility and Community Impact
MCUBench and supporting YOLO training/deployment scripts are available at github.com/Deeplite/deeplite-torch-zoo, enabling reproduction and extension by other researchers and practitioners. As a unified benchmark covering diverse architectures and hardware topologies under controlled settings, MCUBench informs model selection, toolchain development, and hardware-software co-design within the resource-constrained edge AI domain (Sah et al., 2024). It has catalyzed the adoption of consistent benchmarking methods across the μNPU ecosystem (Millar et al., 28 Mar 2025).