RISC-V AME: Matrix and Tensor Processing Extension
- RISC-V AME is an ISA extension that introduces explicit matrix and tensor instructions to accelerate linear algebra tasks such as GEMM and GEMV.
- It employs a tile-based execution model with PEP micro-kernels on HBM-PIM hardware, achieving up to 14.9 GFLOP/s on Aquabolt-XL systems.
- The design maps tile and accumulator registers to memory banks and uses a reduction-free outer-product dataflow to streamline computation and reduce host intervention.
The RISC-V Attached Matrix Extension (AME) is an architectural extension to the RISC-V instruction set architecture (ISA) that introduces explicit matrix and tensor processing instructions to offload and accelerate linear algebra workloads. AME is designed around an abstract tile-based execution model, targeting efficient support for general matrix-matrix multiplication (GEMM), matrix-vector multiplication (GEMV), and element-wise operations. When mapped onto modern High Bandwidth Memory with Processing-in-Memory (HBM-PIM) technologies, such as Samsung Aquabolt-XL, AME enables direct computation within memory, minimizing data movement, host CPU intervention, and off-chip memory traffic. The AME-PIM approach realizes the semantics of AME on real PIM hardware via per-pseudo-channel micro-kernels and a reduction-free outer-product dataflow, achieving up to 14.9 GFLOP/s (59.4 FLOP/cycle) on a single pseudo-channel (Venieri et al., 30 Apr 2026).
1. AME Instruction Set Architecture
AME extends the baseline RISC-V ISA with matrix- and tile-oriented registers, control and status registers (CSRs), and fixed-width instructions for configuration, computation, and data movement. The register state comprises:
- Four tile registers (tr0–tr3), each holding an or matrix block.
- Four accumulator registers (acc0–acc3), each holding an output block.
- CSRs mtilem, mtilek, mtilen, which define the active tile dimensions , , and .
The instruction set is organized into several groups:
| Group | Example Instructions | Description |
|---|---|---|
| Configuration | msettilem, msettilek, msettilen, mrelease | Set tile dimensions, retire tile/acc state |
| Matrix Multiply-Accumulate | mfmacc.h.mm, mfmacc.h.mv.i | Tile-tile and tile-vector MAC |
| Element-wise | mfadd.h.mm/mv, mfmul.h.mm/mv, mfsub.h.mm/mv | Per-tile add/mul/sub |
| Load/Store | mldtile, mstadtile | Tile load/store (with transposed variants) |
| Miscellaneous | mbcast, mpack, mslide | Broadcast, packing, slice/tile manipulation |
| Control | JUMP imm8, EXIT | PEP micro-kernel control flow |
Operands select destination registers (rd), source tiles or scalars/vectors (rs1/rs2), or remain immediate. Function codes encode the operation mode (tile-tile, tile-vector, half-precision, integer). Element-wise min/max are not implementable on current PIM hardware due to the lack of comparison instructions (Venieri et al., 30 Apr 2026).
2. PEP-Based Execution Model
To realize AME semantics on HBM-PIM, each AME instruction is mapped to a PEP (Processing Element Program) micro-kernel. These micro-kernels are loaded into a per-pseudo-channel Command Register File (CRF) and executed in lock-step across all PIM units.
The model separates two phases:
- Setup (All-Bank mode):
- The host broadcasts one PEP into every CRF using an AB command.
- CSRs (mtilem, mtilek, mtilen) and PEP loop counters are programmed by host configuration writes.
- Execution (All-Bank-PIM mode):
- Each DRAM column command triggers a PEP step in each PIM unit.
- Each step executes a CRF-fetched 32-bit PIM instruction such as FILL (load from Even-Bank), ADD/MAC/MUL/SUB (compute), or MOV (store to Odd-Bank).
- High-iteration loops are handled via JUMP imm8 instructions.
- By encapsulating all loop bounds and sequencing into the PEP, only column commands are issued repetitively by the host, decoupling host control from data movement.
Pseudocode (paraphrased for clarity):
- ADD-PEP/MUL-PEP: Loads tile blocks, performs element-wise ADD/MUL, stores results.
- SUB-PEP: Initializes negating vector, implements SUB via MUL+ADD sequence.
- MAC-PEP: Implements the outer-product kernel, broadcasting columns and performing row-wise parallel MACs.
All control-flow is internal to the device, with minimal host involvement.
3. Reduction-Free Outer-Product Dataflow
A principal challenge in mapping AME to PIM is the lack of native cross-lane reduction within the memory device. The AME-PIM approach resolves this via a reduction-free outer-product dataflow.
Matrix multiplication is computed as:
- 0
The MAC-PEP kernel iterates 1 times, processing one column 2 from 3 and one row 4 from 5 per step via loading, broadcasting, and parallel MACs. Each lane independently maintains its row-wise partial sum, enabling accumulation entirely in DRAM Odd-Bank buffers (accumulators). Final results are read out by host-initiated DRAM reads.
A key implication is that the dataflow sidesteps the need for explicit on-chip reductions, a limitation in current HBM-PIM devices.
4. Register Mapping, Data Layout, and Performance
The mapping between AME logical registers and HBM-PIM bank-local data structures is as follows:
- Tile registers (tr0–tr3) occupy Even-Banks of a pseudo-channel, each holding up to 6 elements (7).
- Accumulator registers (acc0–acc3) reside in Odd-Banks.
- Each pseudo-channel comprises eight PIM units, each with 16 FP16 SIMD lanes, providing 128-wide parallelism.
Theoretical peak throughput per pseudo-channel:
- 8 MACs/cycle 9 0 FLOP/cycle (1 MAC = 2 FLOP).
- Measured peak for mfmacc is 1 FLOP/cycle (2 GFLOP/s at 3 MHz), about 4–5 of the theoretical limit, due to data-movement and compute balance.
For element-wise instructions:
| Instruction | Measured FLOP/cycle | GFLOP/s (@250 MHz) |
|---|---|---|
| mfadd | 31.6 | 7.9 |
| mfmul | 25.4 | 6.3 |
| mfsub | 29.9 | 7.5 |
Tiling is column-major per-bank. The address of 6 in tile register 7 is given by:
8
This mapping enables efficient in-memory MAC computation and data movement aligned with PIM hardware constraints.
5. End-to-End Execution for Element-wise, GEMV, and GEMM
The AME-PIM system supports complete, host-minimized execution for GEMV, GEMM, and element-wise kernels:
- The host loads input tiles into Even- and Odd-Bank layouts via standard DRAM writes.
- The host issues AB commands to broadcast and install PEP micro-kernels.
- The host configures mtilem, mtilek, mtilen CSRs and updates PEP loop counters.
- The host enters AB-PIM mode, sending column commands for each micro-step:
- GEMV (9): MAC-PEP requires 0 passes.
- GEMM (1): MAC-PEP steps are repeated for each 2 sub-block.
- PIM units handle all in-memory loads, computation, and stores with no further host intervention.
- Upon completion, the host reads the acc tile(s) using standard DRAM reads.
Reductions are performed directly into Odd-Bank accumulators, eliminating the need for external reduction engines or additional host processing. Host involvement is limited to initial/final DRAM and column commands, dramatically reducing off-chip traffic.
Measured end-to-end performance on a single Aquabolt-XL pseudo-channel (128×4096 tile):
- mfmacc: up to 14.9 GFLOP/s (59.4 FLOP/cycle)
- Element-wise: approximately 7–8 GFLOP/s at maximum tile size
6. Significance and Implications
The AME-PIM realization demonstrates that a JEDEC-compatible PIM device, despite ISA and reduction limitations, can natively execute a tile-level matrix ISA such as AME. By mapping register-level semantics onto HBM bank-resident buffers, and by encapsulating matrix instructions into micro-kernels executed entirely in-memory, the architecture exposes a general-purpose, ISA-level tensor accelerator with minimized host management and memory movement overhead.
A plausible implication is that future HBM-PIM devices, with richer instruction sets or native cross-lane reduction, could further close the gap to ideal throughput for matrix and tensor kernels, and efficiently support broader classes of AME instructions such as min/max and non-arithmetic operations (Venieri et al., 30 Apr 2026).