RISC-V AME: Matrix and Tensor Processing Extension

Updated 5 May 2026

RISC-V AME is an ISA extension that introduces explicit matrix and tensor instructions to accelerate linear algebra tasks such as GEMM and GEMV.
It employs a tile-based execution model with PEP micro-kernels on HBM-PIM hardware, achieving up to 14.9 GFLOP/s on Aquabolt-XL systems.
The design maps tile and accumulator registers to memory banks and uses a reduction-free outer-product dataflow to streamline computation and reduce host intervention.

The RISC-V Attached Matrix Extension (AME) is an architectural extension to the RISC-V instruction set architecture (ISA) that introduces explicit matrix and tensor processing instructions to offload and accelerate linear algebra workloads. AME is designed around an abstract tile-based execution model, targeting efficient support for general matrix-matrix multiplication (GEMM), matrix-vector multiplication (GEMV), and element-wise operations. When mapped onto modern High Bandwidth Memory with Processing-in-Memory (HBM-PIM) technologies, such as Samsung Aquabolt-XL, AME enables direct computation within memory, minimizing data movement, host CPU intervention, and off-chip memory traffic. The AME-PIM approach realizes the semantics of AME on real PIM hardware via per-pseudo-channel micro-kernels and a reduction-free outer-product dataflow, achieving up to 14.9 GFLOP/s (59.4 FLOP/cycle) on a single pseudo-channel (Venieri et al., 30 Apr 2026).

1. AME Instruction Set Architecture

AME extends the baseline RISC-V ISA with matrix- and tile-oriented registers, control and status registers (CSRs), and fixed-width instructions for configuration, computation, and data movement. The register state comprises:

Four tile registers (tr0–tr3), each holding an $M \times K$ or $K \times N$ matrix block.
Four accumulator registers (acc0–acc3), each holding an $M \times N$ output block.
CSRs mtilem, mtilek, mtilen, which define the active tile dimensions $M$ , $K$ , and $N$ .

The instruction set is organized into several groups:

Group	Example Instructions	Description
Configuration	msettilem, msettilek, msettilen, mrelease	Set tile dimensions, retire tile/acc state
Matrix Multiply-Accumulate	mfmacc.h.mm, mfmacc.h.mv.i	Tile-tile and tile-vector MAC
Element-wise	mfadd.h.mm/mv, mfmul.h.mm/mv, mfsub.h.mm/mv	Per-tile add/mul/sub
Load/Store	mldtile, mstadtile	Tile load/store (with transposed variants)
Miscellaneous	mbcast, mpack, mslide	Broadcast, packing, slice/tile manipulation
Control	JUMP imm8, EXIT	PEP micro-kernel control flow

Operands select destination registers (rd), source tiles or scalars/vectors (rs1/rs2), or remain immediate. Function codes encode the operation mode (tile-tile, tile-vector, half-precision, integer). Element-wise min/max are not implementable on current PIM hardware due to the lack of comparison instructions (Venieri et al., 30 Apr 2026).

2. PEP-Based Execution Model

To realize AME semantics on HBM-PIM, each AME instruction is mapped to a PEP (Processing Element Program) micro-kernel. These micro-kernels are loaded into a per-pseudo-channel Command Register File (CRF) and executed in lock-step across all PIM units.

The model separates two phases:

Setup (All-Bank mode):
- The host broadcasts one PEP into every CRF using an AB command.
- CSRs (mtilem, mtilek, mtilen) and PEP loop counters are programmed by host configuration writes.
Execution (All-Bank-PIM mode):
- Each DRAM column command triggers a PEP step in each PIM unit.
- Each step executes a CRF-fetched 32-bit PIM instruction such as FILL (load from Even-Bank), ADD/MAC/MUL/SUB (compute), or MOV (store to Odd-Bank).
- High-iteration loops are handled via JUMP imm8 instructions.
- By encapsulating all loop bounds and sequencing into the PEP, only column commands are issued repetitively by the host, decoupling host control from data movement.

Pseudocode (paraphrased for clarity):

ADD-PEP/MUL-PEP: Loads tile blocks, performs element-wise ADD/MUL, stores results.
SUB-PEP: Initializes negating vector, implements SUB via MUL+ADD sequence.
MAC-PEP: Implements the outer-product kernel, broadcasting columns and performing row-wise parallel MACs.

All control-flow is internal to the device, with minimal host involvement.

3. Reduction-Free Outer-Product Dataflow

A principal challenge in mapping AME to PIM is the lack of native cross-lane reduction within the memory device. The AME-PIM approach resolves this via a reduction-free outer-product dataflow.

Matrix multiplication $C = A \cdot B$ is computed as:

$C \gets C + \sum_{k=0}^{K-1} a_k b_k^T$

$A \in \mathbb{R}^{128 \times K}$
$B \in \mathbb{R}^{K \times N}$
$K \times N$ 0

The MAC-PEP kernel iterates $K \times N$ 1 times, processing one column $K \times N$ 2 from $K \times N$ 3 and one row $K \times N$ 4 from $K \times N$ 5 per step via loading, broadcasting, and parallel MACs. Each lane independently maintains its row-wise partial sum, enabling accumulation entirely in DRAM Odd-Bank buffers (accumulators). Final results are read out by host-initiated DRAM reads.

A key implication is that the dataflow sidesteps the need for explicit on-chip reductions, a limitation in current HBM-PIM devices.

4. Register Mapping, Data Layout, and Performance

The mapping between AME logical registers and HBM-PIM bank-local data structures is as follows:

Tile registers (tr0–tr3) occupy Even-Banks of a pseudo-channel, each holding up to $K \times N$ 6 elements ( $K \times N$ 7).
Accumulator registers (acc0–acc3) reside in Odd-Banks.
Each pseudo-channel comprises eight PIM units, each with 16 FP16 SIMD lanes, providing 128-wide parallelism.

Theoretical peak throughput per pseudo-channel:

$K \times N$ 8 MACs/cycle $K \times N$ 9 $M \times N$ 0 FLOP/cycle (1 MAC = 2 FLOP).
Measured peak for mfmacc is $M \times N$ 1 FLOP/cycle ( $M \times N$ 2 GFLOP/s at $M \times N$ 3 MHz), about $M \times N$ 4– $M \times N$ 5 of the theoretical limit, due to data-movement and compute balance.

For element-wise instructions:

Instruction	Measured FLOP/cycle	GFLOP/s (@250 MHz)
mfadd	31.6	7.9
mfmul	25.4	6.3
mfsub	29.9	7.5

Tiling is column-major per-bank. The address of $M \times N$ 6 in tile register $M \times N$ 7 is given by:

$M \times N$ 8

This mapping enables efficient in-memory MAC computation and data movement aligned with PIM hardware constraints.

5. End-to-End Execution for Element-wise, GEMV, and GEMM

The AME-PIM system supports complete, host-minimized execution for GEMV, GEMM, and element-wise kernels:

The host loads input tiles into Even- and Odd-Bank layouts via standard DRAM writes.
The host issues AB commands to broadcast and install PEP micro-kernels.
The host configures mtilem, mtilek, mtilen CSRs and updates PEP loop counters.
The host enters AB-PIM mode, sending column commands for each micro-step:
- GEMV ( $M \times N$ 9): MAC-PEP requires $M$ 0 passes.
- GEMM ( $M$ 1): MAC-PEP steps are repeated for each $M$ 2 sub-block.
PIM units handle all in-memory loads, computation, and stores with no further host intervention.
Upon completion, the host reads the acc tile(s) using standard DRAM reads.

Reductions are performed directly into Odd-Bank accumulators, eliminating the need for external reduction engines or additional host processing. Host involvement is limited to initial/final DRAM and column commands, dramatically reducing off-chip traffic.

Measured end-to-end performance on a single Aquabolt-XL pseudo-channel (128×4096 tile):

mfmacc: up to 14.9 GFLOP/s (59.4 FLOP/cycle)
Element-wise: approximately 7–8 GFLOP/s at maximum tile size

6. Significance and Implications

The AME-PIM realization demonstrates that a JEDEC-compatible PIM device, despite ISA and reduction limitations, can natively execute a tile-level matrix ISA such as AME. By mapping register-level semantics onto HBM bank-resident buffers, and by encapsulating matrix instructions into micro-kernels executed entirely in-memory, the architecture exposes a general-purpose, ISA-level tensor accelerator with minimized host management and memory movement overhead.

A plausible implication is that future HBM-PIM devices, with richer instruction sets or native cross-lane reduction, could further close the gap to ideal throughput for matrix and tensor kernels, and efficiently support broader classes of AME instructions such as min/max and non-arithmetic operations (Venieri et al., 30 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

AME-PIM: Can Memory be Your Next Tensor Accelerator? (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RISC-V Attached Matrix Extension (AME).