GAP9 SoC: Ultra-Low-Power Edge AI
- GAP9 SoC is an ultra-low-power, highly parallel chip that integrates RISC-V cores with hardware neural and DSP acceleration.
- It features a three-level memory hierarchy and aggressive power management, enabling efficient real-time edge computing.
- Optimized for nano-drones and wearable biosignal processing, GAP9 supports multi-precision computation for AI inference.
GAP9 is an ultra-low-power, highly parallel System-on-Chip (SoC) targeting edge AI and digital signal processing (DSP) workloads, with extensive support for multi-precision computation, advanced memory hierarchy, and aggressive power management techniques. Architecturally rooted in tightly coupled RISC-V clusters and hardware neural acceleration, GAP9 enables inference and signal processing tasks within strict energy budgets for resource-constrained embedded applications such as nano-drones (Müller et al., 27 Jun 2024) and wearable biosignal acquisition (Frey et al., 2023).
1. Microarchitecture and System Overview
GAP9 consists of a heterogeneous multicore layout structured around a tightly coupled compute cluster and a dedicated Fabric Controller (FC). The cluster comprises 8 or 9 RISC-V RV32IMC or RV32IMFC cores (the numbering may vary depending on documentation or specific deployment context), each with transprecision IEEE-754 floating-point units (FP32, FP16, BF16), Xpulp DSP extensions (16-/8-bit SIMD, dot-product, saturating arithmetic), and access to a shared Tightly Coupled Data Memory (TCDM). The NE16 hardware accelerator offloads convolutional and MAC-centric workloads, supporting quantized computation (2–8 bit activations/weights, INT8, INT4, INT2) with asymmetric quantization.
The FC is an independently clocked and power-domain-separated RV32IMC core, responsible for I/O handling, DMA setup, host communication, and power-state transitions, ensuring that the compute cluster can operate in isolation from the system-control logic.
GAP9 SoC Key Architectural Features
| Feature | Specification |
|---|---|
| Compute Cores | 8–9× RV32IMC cores (cluster) |
| Fabric Controller | 1× RV32IMC (I/O, power, DMA) |
| Neural Accelerator | NE16, 32 MACs/cycle, INT8 support |
| L1 TCDM | 512 kB (banked, shared) |
| L2 SRAM | 1.6 MB |
| NVM | 2 MB Spin-Transfer MRAM/eFlash |
| External PSRAM | 256 Mbit |
| External Flash | 512 Mbit (BioGAP: 128 Mbit) |
| Peak Integer Throughput | 15.6 GOPS @ 370 MHz |
| Peak Quantized MAC | 32.2 GMAC/s via NE16 |
| Peak Aggregate Throughput | 150 GOPS |
| Voltage Range | 0.6–1.2 V |
| Frequency Range | 10–370 MHz |
| Sleep Leakage | 45 µW |
Power domains are structured to facilitate independent and dynamic gating: cluster, FC, and I/O can be clock- or power-gated according to workload demands. This strategy allows for deep retention states and fine-tuned tradeoffs between latency, performance, and energy cost (Müller et al., 27 Jun 2024, Frey et al., 2023).
2. Memory Hierarchy and I/O Interfaces
GAP9's three-level memory structure is designed for low-latency access and efficient buffering of high-throughput data streams:
- L1/TCDM: 512 kB banked SRAM, shared by cluster cores, enabling scratchpad operation for both code and data. This region is optimized for minimal arbitration delays among the parallel cores and is directly addressable by DMA engines.
- L2 SRAM: 1.6 MB on-chip, serving as a scratchpad for staging, intermediate results, and sharing between FC and cluster.
- Nonvolatile and External Memory: 2 MB in-package MRAM or eFlash for persistent storage, 256 Mbit external PSRAM for fast-access data/code, and large external flash (up to 512 Mbit) for datasets and static program images (Müller et al., 27 Jun 2024).
- Octal-SPI I/F: All external memory, including PSRAM and Flash, is accessed via high-speed (up to 400 MB/s) octal-SPI interfaces (Frey et al., 2023).
Peripheral interfaces support a diverse range of sensors and off-chip components. Major interfaces include MIPI-CSI2 for camera input, I²S/SAI for audio, I²C/I³C and SPI for sensor connectivity, and UART, PWM, and GPIO for general-purpose use. These interfaces underpin deployment in applications as varied as biosignal acquisition (e.g., EEG- or PPG-sensor front-ends (Frey et al., 2023)) and multi-modal drone perception stacks (Müller et al., 27 Jun 2024).
3. Compute, DSP, and Neural Acceleration Capabilities
GAP9 provides native support for multiple numeric precisions relevant to edge AI and signal processing:
- Transprecision Compute: FP32, FP16, and BF16 floating point supported per cluster core for classical DSP workloads.
- Fixed-Point DSP: Xpulp SIMD extensions and saturating arithmetic for efficient FIR/IIR/FFT computations.
- Quantized AI Acceleration: The NE16 block operates at up to 32.2 GMAC/s on 8-bit integer multiply-accumulate workloads (convolutions), with support for INT8/INT4/INT2 weights/activations, asymmetric quantization, and kernel libraries for common CNN/RNN primitives (Frey et al., 2023).
- Aggregate Throughput: The sum of cluster and accelerator provides up to 150 GOPS, enabling low-latency execution of DNN workloads.
Example workload profiles include 1024-point real FFTs (FP32) completed in ≈13,000 cycles per core (∼35 μs at 370 MHz), with all eight cluster cores used in parallel. Uniform 8-bit quantized YOLOv5-tiny inference can reach 17 ms per frame at 93.5 mW, demonstrating capability for real-time, 60 FPS object detection on nano-UAVs (Müller et al., 27 Jun 2024).
4. Real-World Applications and Benchmarks
Nano-Drone Perception: In the GAP9Shield, GAP9 is paired with a 5 MP OV5647 camera (MIPI-CSI2, raw RGB, measured QVGA@15 FPS, VGA@45 FPS, 80–120 mW camera power), and a five-sensor VL53L1 ToF array (40 samples/s aggregate, omnidirectional obstacle avoidance, 150 mW) (Müller et al., 27 Jun 2024).
On representative tasks:
- YOLOv5-tiny (8-bit): 17 ms inference, 1.59 mJ, 93.5 mW avg power
- MCL localization: 15 Hz ToF + 5 Hz camera, 23 mW avg power
- SLAM (NanoSLAM): iterative pose-graph, <250 ms update, 87.9 mW (Müller et al., 27 Jun 2024)
Wearable Biosignal Processing: In BioGAP, GAP9 executes parallel FFTs and SSVEP routines for BCI at power levels down to 5–8 mW (GAP9 cluster+FC active). Streaming mode (raw data) requires ≈8 mW for GAP9, plus external AFE and BLE for a total of 23–38 mW (see BioGAP Table 1 below) (Frey et al., 2023).
| Mode | GAP9 cluster + FC | AFE | BLE | Total Power | Energy/Sample (1kSPS) |
|---|---|---|---|---|---|
| Streaming (raw data) | ~8 mW | ~18 mW | ~12 mW | ~38 mW | 3.6 µJ |
| On-edge FFT & send | ~5 mW | ~18 mW | ~0.2 mW | ~23 mW | 2.2 µJ |
In object detection and SLAM, system-level inference power is consistently maintained below 100 mW, supporting real-time autonomy and extended mission durations in mobile platforms.
5. Power Management and Energy-Efficiency Metrics
GAP9 implements aggressive Dynamic Voltage and Frequency Scaling (DVFS), fine-grained clock gating, and per-domain power gating.
- Active Mode: Cluster and FC can both run up to 370 MHz at 0.6–1.2 V, with DVFS downscaling to sub-10 MHz and lower voltages for ultra-low-power operation. Operating points such as 240 MHz at 0.65 V are identified as optimal for edge workloads (Frey et al., 2023).
- Sleep & Retention: Deep-sleep (with full retention) is characterized by 45 μW leakage for the entire SoC; peripheral domains (analog front ends, camera I/F) are separately gated (Müller et al., 27 Jun 2024, Frey et al., 2023).
- Power Modes: Run (full active, <100–200 mW including peripherals), standby-idle (oscillator, TCDM), and deep sleep (full logic retention) (Müller et al., 27 Jun 2024).
Key energy-efficiency results:
- Peak: 3030 GOPS/W (1 GOP requires 330 μW, by datasheet)
- Task-level: YOLOv5-tiny achieves 157 GOPS/W (250 MOPS in 1.59 mJ), BioGAP reports 16.7 Mflops/s/mW on FFT tasks (Müller et al., 27 Jun 2024, Frey et al., 2023).
6. Comparative Assessment and System Integration
GAP9-based modules (e.g., GAP9Shield) advance state-of-the-art edge-AI by combining high throughput, I/O density, and miniaturization. Noteworthy comparisons (per (Müller et al., 27 Jun 2024)):
| Metric | AI-Deck + Ranger Deck | GAP9Shield | Improvement |
|---|---|---|---|
| Weight | 7 g | 6 g | –15 % |
| Volume | 6480 mm³ | 4050 mm³ | –37 % |
| RGB QVGA Frame Rate | 5.8 FPS | 7 FPS | +20 % |
| Max Inference Power | ~150 mW | 93 mW | –38 % |
| Multi-dir. Ranging | 60 Hz single sensor | 40 Hz ×5 | Higher throughput |
These system-level advantages are attributable to the co-design of compute, acceleration, and interface circuitry, enabling real-time object detection, localization, and closed-loop control at power/performance points previously unattainable for palm-sized platforms.
7. Implications and Research Significance
Deployments leveraging GAP9 have demonstrated state-of-the-art results in edge-AI for nano-drones—enabling dense sensor fusion, multi-modal SLAM, and sub-20 ms vision inference within strict thermal and mass budgets (Müller et al., 27 Jun 2024). In wearable biomedicine, the same SoC enables near-sensor ML for BCI and biosignal interpretation, achieving >2× bandwidth reduction for wireless data and sub-3 μJ/sample energy profiles (Frey et al., 2023). These capabilities position GAP9 as a competitive platform for next-generation ultra-low-power robotics, human-computer interfaces, and cognitive edge analytics, where throughput, programmability, and power efficiency are all critical design determinants.