Matrix Multiply-Accumulate Instruction

Updated 10 December 2025

Matrix Multiply-Accumulate instructions are specialized operations that perform C += A×B using hardware-level primitives to optimize parallel computation.
They are implemented across scalar, SIMD, matrix-centric, and memory-centric architectures, significantly boosting energy efficiency and throughput.
These instructions enable advanced tiling, structured sparsity, and mixed-precision operations, crucial for accelerating deep learning and scientific computing.

Matrix Multiply-Accumulate (MAC) instructions are specialized architectural primitives implementing the operation $C \mathrel{+}= A \times B$ , where $A$ and $B$ are matrices or vectors and $C$ accumulates the result. MAC instructions underlie core computation in numerical linear algebra, machine learning, signal processing, and scientific computing. They are realized across diverse platforms including general-purpose CPUs, custom vector extensions, memory-centric accelerators, and novel nanodevice logic, and are the most frequent kernel in high-performance workloads.

1. Formal Definition and General Properties

Matrix MAC instructions abstract the atomic update $C += AB$ for matrices $A \in \mathbb{A}^{m\times k}$ , $B \in \mathbb{A}^{k\times n}$ , $C \in \mathbb{A}^{m\times n}$ with accumulation in $C$ . Formally, for classic arithmetic, the operation is: $C_{ij} \mathrel{+}= \sum_{t=0}^{k-1} A_{it} B_{tj},$ optionally generalized to other semirings ( $\oplus,\otimes$ ) as in SIMD $^2$ : $C_{ij} \mathrel{\oplus}= \bigoplus_{t=0}^{k-1} (A_{it} \otimes B_{tj})$ (Zhang et al., 2022). MAC instructions may also operate elementwise in vector or SIMD formats: $vd[i] \leftarrow vd[i] + vs1[i] \cdot vs2[i],\quad i = 0,\dots,VL-1$ (Cavalcante et al., 2022).

Modern fast algorithms (e.g., Strassen) further restructure the computation as a sequence of accumulating bilinear forms, with linear dependencies among inputs and outputs (Dumas et al., 2023). In-place variants maintain input integrity except for designated output accumulators.

2. Architectural Realizations: Scalar, Vector, Matrix, and Memory-centric

2.1. Scalar and SIMD Designs

Scalar MAC (CPU): Implemented as c += a * b. SIMD ISA extensions enable parallel MAC across vector lanes. Example: RISC-V's vector MAC,

1	vmacc.vv vd, vs1, vs2, vm

performs

vd[i] \mathrel{+}= vs1[i] * vs2[i]

(Cavalcante et al., 2022). Mixed-precision instructions such as “VPMADD8x4” compute MAC across asymmetric lane widths (e.g., INT8 × INT4 → INT16 accumulators), doubling throughput and bandwidth efficiency (Gope et al., 2020).

2.2. Matrix-centric Extensions

Matrix-Multiply-Assist (MMA) (POWER10): Introduces eight 512-bit accumulators, supporting instructions for zeroing, moving, and outer-product updates. For example,

1	__builtin_mma_xvf32gerpp(acc64 A, v16b X, v16b Y);

computes

A_{ij} += X_i \cdot Y_j

(Moreira et al., 2021). Tile-based ISAs (e.g., MTE (Santana et al., 4 Jul 2025)) generalize this to arbitrary matrix sizes via programmable control-status registers (CSR) and geometry-agnostic instruction streams:

1	tfmul vd, vs1, vs2 // C += A*B

with tile shape programmed by CSR fields.

Vector/matrix hybrid approaches (MX for RISC-V): Utilize the vector register file and FPU units with a near-FPU tile buffer, accumulating results before write-back, minimizing register file traffic and boosting energy efficiency (Perotti et al., 8 Jan 2024).

2.3. Memory-centric and Systolic Designs

Computing-in-memory MACs (Count2Multiply): Directly realize multiplication-plus-accumulation as masked, parallel k-ary counter increments in DRAM subarrays, with ECC mechanisms assuring reliability. The sequence of row-clones and majority-based logic implements vector-matrix multiplication entirely inside the memory array (Lima et al., 16 Sep 2024).

Domain Wall-MTJ logic (DW-MTJ): Employs nonvolatile, pipelined MAC units with bitwise multiplication (AND gates) and accumulation (ripple-carry adders), controlled by voltage-induced magnetic anisotropy (Zogbi et al., 2023). DW-MTJ MAC arrays offer radiation hardness, fine-grained pipelining, and nonvolatile MAC storage.

3. Instruction Set Encodings and Microarchitecture

3.1. Example Encodings

A selection of MAC encodings is summarized:

ISA/Engine	Mnemonic / Syntax	Operation / MAC Domain
RISC-V Vector (RVV)	`vmacc.vv vd, vs1, vs2`	SIMD MAC: $vd[i] += vs1[i]*vs2[i]$
POWER10 MMA	`xvf32gerpp RA, RX, RY`	Rank-1 update: $A_{ij} += X_i Y_j$
MTE (Vector+Matrix)	`tfmul vd, vs1, vs2`	Matrix MAC: $C += AB$ (CSR-configured)
MX (RISC-V)	`mxmacc vAcc, vA, vB`	Matrix MAC: subtile accumulation
IndexMAC (RVV)	`vindexmac.vx vd,vs2,rs`	$vd[i] += vs2[0]*VRF[rs[4:0]][i]$
SIMD $^2$ (TensorCore)	`simd2.mmo C,A,B,op`	$C_{ij} \oplus \bigotimes_k A_{ik},B_{kj}$
Count2Multiply (MEM)	`MACC rd,rs1,rs2,imm`	In-DRAM MAC via high-radix counting

Instruction encoding details are typically a custom opcode plus source/dest register fields and immediate/CSR configuration (Moreira et al., 2021, Santana et al., 4 Jul 2025, Titopoulos et al., 2023, Titopoulos et al., 17 Jan 2025).

3.2. Microarchitectural Integration

MAC microarchitectures employ:

Parallel MAC units per vector lane, chained or pipelined (Cavalcante et al., 2022).
Dedicated accumulator registers or tile buffers (MMA, MX, DW-MTJ).
Matrix math engines (e.g., POWER10's MME) with per-core, register-resident local accumulation (Moreira et al., 2021).
Indirect addressing and data movement optimizations (e.g., vindexmac.vx multiplexes vector register indices) (Titopoulos et al., 2023).
Memory-intrinsic logic (Count2Multiply): subarray-level counters, row-wise ECC, bulk logic (Lima et al., 16 Sep 2024).

4. Algorithmic Generalizations and Optimality

Matrix MAC implementations are tightly coupled with algorithmic tiling, dataflow, and semiring generalizations.

Fast algorithms: In-place bilinear forms (e.g., Strassen-Winograd) can be recursively mapped to MAC sequences with no extra array memory and $\Theta(n^{\log_2{7}})$ complexity (Dumas et al., 2023).
SIMD $^2$ : Semiring-generalized MMA supports arbitrary $\oplus$ , $\otimes$ pairs, accelerating dynamic-programming, path-finding, and logical operations beyond ordinary GEMM (Zhang et al., 2022).
Structured sparsity: Specialized index-MAC primitives (vindexmac.vx) for block-sparse × dense products eliminate repetitive loads/gathers and reduce memory traffic by up to 65% (Titopoulos et al., 17 Jan 2025, Titopoulos et al., 2023).
Mixed precision: Asymmetric SIMD MAC (VPMADD8x4) processes $8$ outputs per instruction (INT8 × INT4 → INT16), delivering $2\times$ throughput at fixed bandwidth (Gope et al., 2020).

In-place accumulation, with input restoration constraints, can be applied to arbitrary bilinear forms provided each sub-function is in-place-compatible and does not use output slots as scratch (Dumas et al., 2023).

5. Hardware Efficiency, Scalability, and Reliability

MAC instruction efficiency derives from parallelism, accumulator locality, and architectural data movement reduction.

POWER10 MMA: Delivers $4\times$ core throughput and $7\times$ drop in energy per MAC compared to POWER9 VSX, with modest area/power overhead (<12%) (Moreira et al., 2021).
Vector MAC (Spatz): 7.9 pJ/integer MAC in RVV MACU vs. 13.1 pJ/scalar MAC, doubling energy efficiency and reducing instruction fetch bottlenecks (Cavalcante et al., 2022).
MX (Vector FPU hybrid): +56% performance and +25% energy efficiency using near-FPU tile buffers, no penalty to FPU utilization (Perotti et al., 8 Jan 2024).
In-memory MAC (Count2Multiply): $10\times$ speedup, $8\times$ GOPS/W, and $9.5\times$ GOPS/area over prior in-DRAM approaches, via high-radix bulk counting and ECC integration (Lima et al., 16 Sep 2024).
DW-MTJ MAC: Offers nonvolatile logic, $2$–$15$ TOPS at 4–8-bit precision, and intrinsic radiation hardness with fine-grained pipelining (Zogbi et al., 2023).
Matrix ISA (MTE): Geometry-agnostic MMA via reinterpreted vector register tiles, sustaining $1.2$– $2.7\times$ speedups and $25$–$30$% energy-to-solution gains vs. fixed-form ISAs (Santana et al., 4 Jul 2025).

6. Programming Models and System Integration

MAC instructions are exposed to software via:

Low-level assembly or intrinsic functions (e.g., MMA built-ins in GCC/LLVM) (Moreira et al., 2021).
PTX/C++ APIs for general semiring operation (SIMD $^2$ ), supporting flexible algorithm acceleration (Zhang et al., 2022).
CSR-driven matrix-tile configuration and vector-mask integration for mixed vector/matrix kernels (MTE, MX) (Santana et al., 4 Jul 2025, Perotti et al., 8 Jan 2024).
Controller microprogramming for memory-centric MAC (Count2Multiply), integrating ECC steps and row-wise operation (Lima et al., 16 Sep 2024).
End-to-end benchmarks reveal that dynamic instruction count, front-end bandwidth, and hardware utilization are all significantly increased when MAC instructions are employed as native primitives.

7. Applicability, Limitations, and Trade-offs

The efficiency and correctness advantages of MAC instructions depend on problem structure, hardware co-design, and algorithm compatibility.

In-place accumulating MAC is possible only when output variables are not needed as scratch, and all inputs can be restored post-computation (Dumas et al., 2023).
Structured-sparse matrix MAC (index-MAC) only applies when sparsity pattern admits blockwise row/col tiling; unstructured patterns are less amenable to these optimizations (Titopoulos et al., 2023).
Fixed-shape matrix ISAs lose efficiency on skinny or tall matrices; geometry-agnostic MAC extensions (MTE) mitigate this (Santana et al., 4 Jul 2025).
Mixed-precision MACs depend on accumulator width and statistical overflow behavior (DNNs tolerate <0.05% overflow) (Gope et al., 2020).
Memory-centric MAC faces reliability, ECC overhead, and device physics limits; correctness requires parity synchronization and replay protocols (Lima et al., 16 Sep 2024).
Systolic DW-MTJ MACs offer unique robustness but are constrained by device speed and multiplier pipeline depth (Zogbi et al., 2023).

In all cases, MAC instructions remain indispensable in aligning algorithmic needs with microarchitectural capabilities, optimizing memory bandwidth, compute intensity, and device-level performance across hardware and software boundaries.