Tinyissimo YOLO: Compact Object Detection
- Tinyissimo YOLO is a family of ultra-compact, fully quantized detection networks designed for microcontrollers with extreme memory (<1 MB flash) and energy constraints.
- They employ aggressive compression techniques, quantization-aware training, and iterative pruning to achieve practical mAP levels (often >50%) on datasets like PascalVOC and COCO.
- Optimized for hardware-specific mapping, these models enable always-on, real-time detection in IoT and wearable applications with energy per inference as low as 150 μJ.
Tinyissimo YOLO refers to a class of object detection networks and implementation strategies targeting extreme resource constraints, in particular enabling single-stage, real-time detection using less than 1 MB—and often less than 500 kB—of total model storage, with full 8-bit quantization for microcontroller-centric deployment scenarios. These networks are direct descendants of the YOLO (You Only Look Once) family, but undergo aggressive architectural compression, quantization-aware optimization, and hardware-specific adaptation to achieve operational feasibility on devices with milliwatt-level power budgets, ≤0.5 MB SRAM, and no hardware floating-point units (Moosmann et al., 2023, Moosmann et al., 2023, Moosmann et al., 2023, Deutel et al., 2024).
1. Design Principles and Motivations
Tinyissimo YOLO models are formulated around several core principles: (1) minimizing total parameter count and memory footprint, (2) maximizing multiply-accumulate (MAC) efficiency on integer datapaths, (3) accommodating deployment constraints specific to microcontroller units (MCUs) and MCUs with hardware accelerators, and (4) achieving practically useful detection accuracy (e.g., mAP>50%) on datasets such as PascalVOC or COCO, even for multi-class scenarios. The primary design constraints are <0.5 MB flash (for weights), minimal SRAM utilization, and inference energy at or below 1 mJ/inference, enabling always-on and wearable applications. Avoidance of deep residual modules, large kernel widths, and non-quantizable or memory-inefficient layers is universal (Moosmann et al., 2023, Hollard et al., 2024).
2. Representative Network Architectures
The architectural spectrum includes several canonical instantiations:
- TinyissimoYOLO Baseline
- Input: 88×88×3
- Four 3×3 Conv+ReLU layers with channel progression 16 → 128, each followed by 2×2 max pooling
- Flattened output (3,200) passed to FC of 256, then to output FC sized for the prediction grid
- Typical parameterization (B=2, C=1): ≈422k params; (B=1, C=3): ≈398k params
- Only 3×3 convolutions; no residuals, squeeze–excite, or depthwise splits
- Variant: TinyissimoYOLO “TY:3-3-88” Six 3×3 Conv+BN+ReLU layers (channels 3→64), five interleaved 2×2 max pools, and a final FC layer outputting S²(B·5+C) predictions. The smallest release (B=1, C=3, S=2) fits in 441 kB flash and ≤200 kB peak activations (Moosmann et al., 2023).
- TinyissimoYOLO for Smart Glasses (v1.3, v5, v8) Deeper 256×256 input, single-scale detection, backbone with 3×3 Conv+BN+ReLU (channels 16–128), using YOLOv5/YOLOv8-inspired head variants, all sub-million parameter (min ≈403k, max ≈960k), always quantized to 8 bits throughout (Moosmann et al., 2023).
- microYOLO 128×128 input, 7-stage depthwise separable backbone, detection head via two-layer FC, S=5, B=2, N=C for tasks, achieving Flash < 800 kB and SRAM <350 kB (Deutel et al., 2024).
- xYOLO Tiny-YOLOv3-pruned, only 205k params, up to 70× inference speedup vs Tiny-YOLO on CPU/MCU, using input/channel/layer pruning and selective XNOR binary convolution (Barry et al., 2019).
3. Quantization and Compression Techniques
A universal component is aggressive quantization:
- QAT (Quantization-Aware Training)
- Training first epochs (typically 350) at float32, then 300 epochs QAT, inserting “fake quant” (simulated int8 quantization) after every conv/FC and before activations (Moosmann et al., 2023).
- Both weights and activations are mapped to 8 bit integers, with symmetric (weights) and asymmetric (activations) per-tensor quantization and storage of zero-points and scales per layer.
- Memory formula for weights:
for 8-bit encoding, and similarly for activations (Moosmann et al., 2023, Moosmann et al., 2023).
Post-Training Quantization (PTQ)
- In deployment-only scenarios, all weights/acts are mapped to int8, with empirical maximum drop of <1 pp mAP compared to FP32 (Moosmann et al., 2023, Moosmann et al., 2023).
- Combined with iterative gradual pruning (up to 70%), yielding sparse quantized networks suitable for direct C code generation and execution via libraries such as ARM CMSIS-NN (Deutel et al., 2024).
- Parameter Pruning
- Unstructured (lowest-|w|) pruning is commonly run late in training, e.g., in the last 20–100 epochs (Deutel et al., 2024).
- All non-FC and FC layers are pruned to achieve approximately 50–70% sparsity, balancing speed, and accuracy losses.
4. Compute, Memory, and Deployment Characteristics
The resource footprint of Tinyissimo YOLO models is summarized by key metrics (referencing (Moosmann et al., 2023, Moosmann et al., 2023, Moosmann et al., 2023, Deutel et al., 2024)):
| Network | Params (k) | Flash (kB) | Peak RAM (kB) | MACs (MMAC) | Input | mAP (%) | Inference (Platform) |
|---|---|---|---|---|---|---|---|
| TinyissimoYOLO | 422 | 422 | 350 | ~32.5 | 88² | 58.5–63 | 5.5 ms (MAX78000, 180fps) |
| TY:3-3-88 | 440 | 441 | 190 | ~32.5 | 88² | 61.8 | 2.12 ms (GAP9 NE16) |
| xYOLO | 205 | 820 | — | 39 | 256² | 67 | 9.66fps (Raspberry Pi 3) |
| microYOLO | 500–700 | <800 | <350 | — | 128² | Up to 56 | 3.45fps (Cortex-M7) |
| TinyissimoYOLO v1.3 | 403–960 | — | — | 0.9–2.1 | 256² | >42 | 17ms (GAP9 NE16) |
- Peak activation RAM typically stays in the 150–375 kB range, fit within modern MCU SRAMs (512 kB–1 MB).
- Inference energy varies sharply by platform: 196 μJ on MAX78000 @88² (CNN accel), 150 μJ on GAP9 NE16, 7.8 mJ (Apollo4b, Cortex-M4) for the same workload (Moosmann et al., 2023, Moosmann et al., 2023).
- MAC/cycle efficiencies up to 107 have been observed on dedicated accelerators (MAX78000); pure SW MCUs remain at 0.25–0.5 MAC/cycle.
5. Detection Accuracy and Trade-offs
- Single-class detection on WiderFace yields 45.3–46.2% mAP (IoU=0.5, no restriction), up to 75.4% mAP when restricting test set to ≤5 faces per image (Moosmann et al., 2023).
- Multi-class VOC evaluation (3 classes): person 57.4%, chair 30.2%, car 65.1%, overall 58.5% mAP.
- For more complex datasets (PascalVOC, 10–20 classes, higher input res):
- 10-class, 112×112: 60.4% mAP; 3-class, 112×112: 63.1% mAP
- For 20-class, 224×224, mAP drops to 53.1% (Moosmann et al., 2023).
- For microYOLO at 128×128, task-dependent mAP: fridge items 56.4%, humans 27.7%, vehicles 12.3% (with “simplified” 3-object max, up to 50% mAP) (Deutel et al., 2024).
- The impact of quantization, even without QAT, is typically <1pp mAP with representative post-training quantization (PTQ). Extreme model pruning can incur larger accuracy losses but enables higher FPS/mobile deployment (Moosmann et al., 2023, Deutel et al., 2024).
6. Hardware Mapping and Platform-Specific Optimizations
Tinyissimo YOLO networks are tailored for a spectrum of microcontroller and MCU-accelerator hybrid platforms:
- Analog Devices MAX78000: 8-bit CNN accelerator, all weights in on-chip SRAM (max ~442 kB), pipelined 2D tiling per layer, achieved up to 180 fps at 196 μJ/inference (Moosmann et al., 2023, Moosmann et al., 2023).
- Greenwaves GAP9 RISC-V:
- CPU cluster execution (up to 8 cores) with ILP-based hierarchical tiling between L1/L2, “DORY” tiling for overlays, custom SIMD dot-product kernels (up to 8 MACs/cycle conv) (Moosmann et al., 2023, Moosmann et al., 2023).
- NE16 CNN accelerator: hardware mapping via NNTool+Autotiler, fusing batchnorm+ReLU, delivering 2.12 ms, 149 μJ/inference at 88×88 (Moosmann et al., 2023, Moosmann et al., 2023).
- Power-cycling: cluster off except during inference, idle standby <1 mW (smart glasses) (Moosmann et al., 2023).
- ARM Cortex M4/M7 (STM32, Apollo4b):
- On-device execute via TensorFlow Lite Micro or CMSIS-NN, yielding orders-of-magnitude longer inference times (200–500 ms) and higher energy (mJ/inf).
- Specialized kernels for depthwise, pointwise conv, and FC; aggressive SRAM buffer reuse (Moosmann et al., 2023, Moosmann et al., 2023, Deutel et al., 2024).
7. Comparative Perspective and Evolution of “Tinyissimo” Design
Tinyissimo YOLO sits at the extreme compression/efficiency end of the YOLO ecosystem, distinguished from:
- YOLO Nano (4 MB, ~4 M params, 69.1% mAP, 4.57 GOPs, >25 fps @15 W) (Wong et al., 2019)
- YOLO-TLA (9.49 M params, 25.3 GFLOPs; [email protected]: 60.3%, with C3CrossConv and GAM for small-object focus) (Ji et al., 2024)
- LeYOLO-Nano (1.1 M params, 0.66 GFLOP, 37.7% AP₅₀, 99.6 QPS Jetson TX2) (Hollard et al., 2024)
Compared to typical Tiny-YOLO variants, Tinyissimo YOLO achieves two orders-of-magnitude size reduction, a similar decrease in operations, and is universally quantized and pruned. The margin for further compression (Flash ≤250 kB, RAM ≤150 kB) is reached via extensive pruning, lower input resolutions, deeper quantization, and elimination of large multi-scale heads.
In the domain of resource-constrained object detection, Tinyissimo YOLO provides a blueprint for (1) single-stage detection architectures pared down to minimum width/depth, (2) universal 8 bit quantization with minimal loss, (3) hardware-specific mapping to near-peak integer MAC utilization, and (4) operational feasibility for always-on IoT and wearable vision deployment contexts (Moosmann et al., 2023, Moosmann et al., 2023, Deutel et al., 2024).