Papers
Topics
Authors
Recent
Search
2000 character limit reached

RISC-V AME: Matrix and Tensor Processing Extension

Updated 5 May 2026
  • RISC-V AME is an ISA extension that introduces explicit matrix and tensor instructions to accelerate linear algebra tasks such as GEMM and GEMV.
  • It employs a tile-based execution model with PEP micro-kernels on HBM-PIM hardware, achieving up to 14.9 GFLOP/s on Aquabolt-XL systems.
  • The design maps tile and accumulator registers to memory banks and uses a reduction-free outer-product dataflow to streamline computation and reduce host intervention.

The RISC-V Attached Matrix Extension (AME) is an architectural extension to the RISC-V instruction set architecture (ISA) that introduces explicit matrix and tensor processing instructions to offload and accelerate linear algebra workloads. AME is designed around an abstract tile-based execution model, targeting efficient support for general matrix-matrix multiplication (GEMM), matrix-vector multiplication (GEMV), and element-wise operations. When mapped onto modern High Bandwidth Memory with Processing-in-Memory (HBM-PIM) technologies, such as Samsung Aquabolt-XL, AME enables direct computation within memory, minimizing data movement, host CPU intervention, and off-chip memory traffic. The AME-PIM approach realizes the semantics of AME on real PIM hardware via per-pseudo-channel micro-kernels and a reduction-free outer-product dataflow, achieving up to 14.9 GFLOP/s (59.4 FLOP/cycle) on a single pseudo-channel (Venieri et al., 30 Apr 2026).

1. AME Instruction Set Architecture

AME extends the baseline RISC-V ISA with matrix- and tile-oriented registers, control and status registers (CSRs), and fixed-width instructions for configuration, computation, and data movement. The register state comprises:

  • Four tile registers (tr0–tr3), each holding an M×KM \times K or K×NK \times N matrix block.
  • Four accumulator registers (acc0–acc3), each holding an M×NM \times N output block.
  • CSRs mtilem, mtilek, mtilen, which define the active tile dimensions MM, KK, and NN.

The instruction set is organized into several groups:

Group Example Instructions Description
Configuration msettilem, msettilek, msettilen, mrelease Set tile dimensions, retire tile/acc state
Matrix Multiply-Accumulate mfmacc.h.mm, mfmacc.h.mv.i Tile-tile and tile-vector MAC
Element-wise mfadd.h.mm/mv, mfmul.h.mm/mv, mfsub.h.mm/mv Per-tile add/mul/sub
Load/Store mldtile, mstadtile Tile load/store (with transposed variants)
Miscellaneous mbcast, mpack, mslide Broadcast, packing, slice/tile manipulation
Control JUMP imm8, EXIT PEP micro-kernel control flow

Operands select destination registers (rd), source tiles or scalars/vectors (rs1/rs2), or remain immediate. Function codes encode the operation mode (tile-tile, tile-vector, half-precision, integer). Element-wise min/max are not implementable on current PIM hardware due to the lack of comparison instructions (Venieri et al., 30 Apr 2026).

2. PEP-Based Execution Model

To realize AME semantics on HBM-PIM, each AME instruction is mapped to a PEP (Processing Element Program) micro-kernel. These micro-kernels are loaded into a per-pseudo-channel Command Register File (CRF) and executed in lock-step across all PIM units.

The model separates two phases:

  • Setup (All-Bank mode):
    • The host broadcasts one PEP into every CRF using an AB command.
    • CSRs (mtilem, mtilek, mtilen) and PEP loop counters are programmed by host configuration writes.
  • Execution (All-Bank-PIM mode):
    • Each DRAM column command triggers a PEP step in each PIM unit.
    • Each step executes a CRF-fetched 32-bit PIM instruction such as FILL (load from Even-Bank), ADD/MAC/MUL/SUB (compute), or MOV (store to Odd-Bank).
    • High-iteration loops are handled via JUMP imm8 instructions.
    • By encapsulating all loop bounds and sequencing into the PEP, only column commands are issued repetitively by the host, decoupling host control from data movement.

Pseudocode (paraphrased for clarity):

  • ADD-PEP/MUL-PEP: Loads tile blocks, performs element-wise ADD/MUL, stores results.
  • SUB-PEP: Initializes negating vector, implements SUB via MUL+ADD sequence.
  • MAC-PEP: Implements the outer-product kernel, broadcasting columns and performing row-wise parallel MACs.

All control-flow is internal to the device, with minimal host involvement.

3. Reduction-Free Outer-Product Dataflow

A principal challenge in mapping AME to PIM is the lack of native cross-lane reduction within the memory device. The AME-PIM approach resolves this via a reduction-free outer-product dataflow.

Matrix multiplication C=ABC = A \cdot B is computed as:

CC+k=0K1akbkTC \gets C + \sum_{k=0}^{K-1} a_k b_k^T

  • AR128×KA \in \mathbb{R}^{128 \times K}
  • BRK×NB \in \mathbb{R}^{K \times N}
  • K×NK \times N0

The MAC-PEP kernel iterates K×NK \times N1 times, processing one column K×NK \times N2 from K×NK \times N3 and one row K×NK \times N4 from K×NK \times N5 per step via loading, broadcasting, and parallel MACs. Each lane independently maintains its row-wise partial sum, enabling accumulation entirely in DRAM Odd-Bank buffers (accumulators). Final results are read out by host-initiated DRAM reads.

A key implication is that the dataflow sidesteps the need for explicit on-chip reductions, a limitation in current HBM-PIM devices.

4. Register Mapping, Data Layout, and Performance

The mapping between AME logical registers and HBM-PIM bank-local data structures is as follows:

  • Tile registers (tr0–tr3) occupy Even-Banks of a pseudo-channel, each holding up to K×NK \times N6 elements (K×NK \times N7).
  • Accumulator registers (acc0–acc3) reside in Odd-Banks.
  • Each pseudo-channel comprises eight PIM units, each with 16 FP16 SIMD lanes, providing 128-wide parallelism.

Theoretical peak throughput per pseudo-channel:

  • K×NK \times N8 MACs/cycle K×NK \times N9 M×NM \times N0 FLOP/cycle (1 MAC = 2 FLOP).
  • Measured peak for mfmacc is M×NM \times N1 FLOP/cycle (M×NM \times N2 GFLOP/s at M×NM \times N3 MHz), about M×NM \times N4–M×NM \times N5 of the theoretical limit, due to data-movement and compute balance.

For element-wise instructions:

Instruction Measured FLOP/cycle GFLOP/s (@250 MHz)
mfadd 31.6 7.9
mfmul 25.4 6.3
mfsub 29.9 7.5

Tiling is column-major per-bank. The address of M×NM \times N6 in tile register M×NM \times N7 is given by:

M×NM \times N8

This mapping enables efficient in-memory MAC computation and data movement aligned with PIM hardware constraints.

5. End-to-End Execution for Element-wise, GEMV, and GEMM

The AME-PIM system supports complete, host-minimized execution for GEMV, GEMM, and element-wise kernels:

  1. The host loads input tiles into Even- and Odd-Bank layouts via standard DRAM writes.
  2. The host issues AB commands to broadcast and install PEP micro-kernels.
  3. The host configures mtilem, mtilek, mtilen CSRs and updates PEP loop counters.
  4. The host enters AB-PIM mode, sending column commands for each micro-step:
    • GEMV (M×NM \times N9): MAC-PEP requires MM0 passes.
    • GEMM (MM1): MAC-PEP steps are repeated for each MM2 sub-block.
  5. PIM units handle all in-memory loads, computation, and stores with no further host intervention.
  6. Upon completion, the host reads the acc tile(s) using standard DRAM reads.

Reductions are performed directly into Odd-Bank accumulators, eliminating the need for external reduction engines or additional host processing. Host involvement is limited to initial/final DRAM and column commands, dramatically reducing off-chip traffic.

Measured end-to-end performance on a single Aquabolt-XL pseudo-channel (128×4096 tile):

  • mfmacc: up to 14.9 GFLOP/s (59.4 FLOP/cycle)
  • Element-wise: approximately 7–8 GFLOP/s at maximum tile size

6. Significance and Implications

The AME-PIM realization demonstrates that a JEDEC-compatible PIM device, despite ISA and reduction limitations, can natively execute a tile-level matrix ISA such as AME. By mapping register-level semantics onto HBM bank-resident buffers, and by encapsulating matrix instructions into micro-kernels executed entirely in-memory, the architecture exposes a general-purpose, ISA-level tensor accelerator with minimized host management and memory movement overhead.

A plausible implication is that future HBM-PIM devices, with richer instruction sets or native cross-lane reduction, could further close the gap to ideal throughput for matrix and tensor kernels, and efficiently support broader classes of AME instructions such as min/max and non-arithmetic operations (Venieri et al., 30 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RISC-V Attached Matrix Extension (AME).