Compressed Modular Matrix Multiplication

Updated 25 February 2026

Compressed modular matrix multiplication is a family of methods that compress multiple modular entries into single hardware words to reduce arithmetic cost and enhance efficiency.
Techniques such as Q-adic packing, multiword decompositions, Kronecker substitution, and CRT-based slicing optimize resource usage on modern CPUs and GPUs.
These methods leverage algebraic, combinatorial, and structural insights to achieve scalable, high-throughput exact computations in modular arithmetic.

Compressed modular matrix multiplication refers to a family of algorithmic techniques that accelerate modular (or exact) matrix multiplication via data packing, modular decomposition, dimensionality reduction, or combinatorial compression. These approaches enable efficient use of hardware resources—especially integer and low-precision units, floating-point SIMD engines, or memory-bounded settings—when multiplying matrices over rings such as finite fields or modular integer rings. The main principles involve reducing arithmetic cost, memory bandwidth, or storage by leveraging representations that compress multiple small modular entries into a single hardware word or arithmetic operation, or that allow high-accuracy floating-point emulation by modular slicing. The field encompasses both practical engineering advances and deep connections to number theory and graph-theoretic structure.

1. Q-adic Packing and SWAR Arithmetic

Fundamental to compressed modular arithmetic is the idea of storing $k$ modular residues in a base- $Q=2^b$ expansion within a single $w$ -bit word, where $p < Q$ and $k = \lfloor w/b \rfloor$ . Each $k$ -tuple $(a_0,\ldots,a_{k-1}) \bmod p$ is represented as $A = \sum_{i=0}^{k-1} a_i Q^i$ , permitting packed modular addition and subtraction by word-wise operations with block masks and conditional subtraction, as well as compressed dot products by single multi-word integer multiplication and extraction of the relevant digit.

Table: Q-adic Packing Parameters

Parameter	Symbol	Typical Value / Constraint
Word size	$w$	32, 64
Modulus	$p$	$p < 2^w$
Block bits	$b=\lceil\log_2 p\rceil$
Packing factor	$k$	$\lfloor w/b \rfloor$
Base	$Q$	$2^b$

Given matrices $A$ and $B$ , one packs rows and reversed columns, then computes each dot product using a single integer multiply, bitshift, and modular reduction. The $k$ -fold inner product is thus compressed into a single operation, and total multiplication cost is reduced by a factor of $k$ compared to naïve methods, subject to non-overflow constraints, with explicit bounds $k(p-1)^2 < 2^b$ and $kb \leq w$ (0803.1975).

2. Multiword and Blockwise Decompositions

For large primes or when leveraging floating-point arithmetic for modular matrix multiplication, multiword (blockwise) decompositions generalize Q-adic packing. Each input matrix $A$ (or $B$ ) with entries in $\{0,\dots,p-1\}$ is expanded as $A = \sum_{i=0}^{u-1} 2^{\beta i} A_i$ with $\beta$ chosen such that each submatrix $A_i$ contains coefficients bounded to $\beta$ bits for exact representation in floating-point mantissas (Berthomieu et al., 12 Jan 2026).

For $p$ exceeding the half-mantissa threshold (e.g., 26 bits in double precision), two-word decompositions $u=2$ suffice to cover up to the full mantissa range ( $\sim52$ bits).
Each block multiplication is performed by BLAS GEMM routines, with modular reduction and scaling steps driven by the expansion exponents.

This technique enables high-throughput modular matrix multiplication on both CPUs and GPUs for moderately large $p$ , with a controlled tradeoff: the number of blockwise matrix multiplications increases as a function of required bits, but covers much larger $p$ and achieves high Gflops/s plateau (Berthomieu et al., 12 Jan 2026).

3. Kronecker Substitution and Scalar Packing

Kronecker substitution, also referred to as scalar packing, packs entire vectors of residues into a single high-precision integer via base- $R$ expansion, executes a high-precision (possibly homomorphic) multiplication, and then unpacks the inner products by digit extraction (Ramapragada et al., 20 Apr 2025).

For modulus $p < R$ , block size $k$ is chosen such that $k \lceil \log_2 p \rceil \leq$ available word size.
The packing is $A = a_1 + a_2 R + ... + a_k R^{k-1}$ , $B = b_k + b_{k-1} R + ... + b_1 R^{k-1}$ ; their integer product $C = A B$ contains as the coefficient at $R^{k-1}$ precisely $\sum_{i=1}^k a_i b_i$ .
The method avoids CRT and enables dimensionality reduction in the most bandwidth- or arithmetic-bound regimes (0803.1975, Ramapragada et al., 20 Apr 2025).

Practical speedup is observed in modular product code and encrypted implementations, with $k=3,4,5$ commonly usable on 64-bit words and 8–12 bit $p$ (Ramapragada et al., 20 Apr 2025).

4. CRT-Based Modular Slicing for High-Accuracy (Ozaki II)

The Ozaki Scheme II leverages the Chinese Remainder Theorem (CRT) for efficient floating-point matrix product emulation by compressing floating-point operands to modular integers, multiplying via multiple GEMM calls over small coprime moduli, and reconstructing the integer result (Ozaki et al., 10 Apr 2025):

Scaling/Truncation: Inputs $A,B$ (FP64) are scaled to integer $A',B'$ via diagonal scaling matrices.
Residue Decomposition: For $s$ pairwise-coprime moduli $m_1,\ldots,m_s$ , compute $A'^{(t)} = A' \bmod m_t$ , $B'^{(t)} = B' \bmod m_t$ .
Parallel GEMMs: Compute $C^{(t)} = A'^{(t)} B'^{(t)}$ for each $t$ , with exact integer accumulation.
CRT Reconstruction: Combine results using $Y = \sum_{t=1}^s C^{(t)} M_t y_t \bmod M$ , with $M_t = M/m_t$ , $y_t = M_t^{-1} \bmod m_t$ .
Final Unscaling: $C = D^{-1} X E^{-1}$ , $X = Y - \mathrm{round}(Y/M) M$ .

Key properties include:

Control of precision through $s$ and moduli $m_t$ , with precision increasing linearly in $s$ .
Complexity reduced from $O(k^2 p q r)$ in standard Ozaki I to $O(s p q r)$ , with $s \ll k^2 /2$ for double precision.
Performance on NVIDIA GH200 shows up to $1.3\times$ FP64 GEMM throughput; on CPUs, $2.29\times$ speedup for quadruple-precision emulation (Ozaki et al., 10 Apr 2025).

5. Hash-Based Polynomial Compression and Sparse Multiplication

Randomized polynomial sketching with hash and sign functions compresses the $n\times n$ product $AB$ into a polynomial modulo $x^b-1$ , with each matrix entry estimated from a small vector of coefficients (Pagh, 2011). The essential steps include:

Choose $b$ buckets, 2-wise independent hashings $h_1,h_2$ and signs $s_1,s_2$ .
For each outer product, accumulate signed monomials into polynomials $P_1(x)$ and $P_2(x)$ , then compute and aggregate their convolutions (via FFT).
Each $C_{ij}$ is then estimated as $s_1(i) s_2(j) c_{(h_1(i)+h_2(j)) \bmod b}$ , with unbiasedness and variance controlled by $\|\!AB\|_F^2 / b$ .
For sparse products, error-correcting codes and restricted sketches enable exact recovery when the true product is sufficiently sparse, in nearly linear time.

This approach yields a tunable tradeoff between accuracy (additive error) and computation, with $\tilde O(n^2 + nb)$ arithmetic and $O(b\log n)$ space, and can be used for fast approximate or sparse-exact matrix multiplication (Pagh, 2011).

6. Structure-Aware Compression: Twin-width and Tree Decompositions

Structural graph-theoretic compression via twin-width and twin-decomposition encodes matrices as tree-based composites of small bicliques, parameterized by the twin-width $d$ (Bonnet et al., 2022). For matrices over finite fields with bounded twin-width:

A compact tree-like encoding is constructed, with a decomposition width bounded in $d$ .
Matrix product $AB$ is computed via first-order logic with modular counting (FO+MOD), using block-matrix squaring and projection, in time $O_{d,q}(n^2 \log n)$ .
For $\mathbb{F}_2$ and inputs given as twin-decompositions, single-exponential algorithms in $d$ achieve $4^{d+o(d)} n$ time and ultra-fast entry queries.

The key theorems establish closure under FO+MOD transductions, bounded twin-width of products, and explicit algorithms parameterized by structural complexity, enabling efficient modular computation when exploitably structured (Bonnet et al., 2022).

7. Comparative Tradeoffs and Practical Impact

Compressed modular matrix multiplication achieves dimensionality reduction, improved throughput, and hardware-optimized performance by exploiting algebraic, combinatorial, and architectural constraints:

Q-adic packing and multiword decompositions provide nearly $k$ -fold speedup or bit-size scaling when $k$ is maximized relative to modulus and word size (0803.1975, Berthomieu et al., 12 Jan 2026).
CRT-based modular slicing (Ozaki II) bridges floating-point and integer domains for ultra-fast, accurate matrix multiplication and precision-tunable emulation (Ozaki et al., 10 Apr 2025).
Polynomial sketching enables approximate and sparse-exact multiplication with accuracy-vs-speed tunability, leveraging fast Fourier transforms (Pagh, 2011).
Graph-structural algorithms leverage twin-width for subquadratic multiplication within families of structured matrices (Bonnet et al., 2022).
Scalar packing/Kronecker substitution is used in privacy-preserving and encrypted regimes to compress homomorphic/packed arithmetic (Ramapragada et al., 20 Apr 2025).

These methods are not mutually exclusive—hybrid approaches combining compression, structure, and modular slicing are prominent in state-of-the-art high-performance modular arithmetic and exact linear algebra software. The choice among them depends on modulus size, matrix dimension, sparsity or structure, and target hardware.

Markdown Report Issue Upgrade to Chat

References (6)

Compressed Modular Matrix Multiplication (2008)

Multiword matrix multiplication over large finite fields in floating-point arithmetic (2026)

Fast Plaintext-Ciphertext Matrix Multiplication from Additively Homomorphic Encryption (2025)

Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique (2025)

Compressed Matrix Multiplication (2011)

Twin-width V: linear minors, modular counting, and matrix multiplication (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compressed Modular Matrix Multiplication.

Compressed Modular Matrix Multiplication

1. Q-adic Packing and SWAR Arithmetic

2. Multiword and Blockwise Decompositions

3. Kronecker Substitution and Scalar Packing

4. CRT-Based Modular Slicing for High-Accuracy (Ozaki II)

5. Hash-Based Polynomial Compression and Sparse Multiplication

6. Structure-Aware Compression: Twin-width and Tree Decompositions

7. Comparative Tradeoffs and Practical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Compressed Modular Matrix Multiplication

1. Q-adic Packing and SWAR Arithmetic

2. Multiword and Blockwise Decompositions

3. Kronecker Substitution and Scalar Packing

4. CRT-Based Modular Slicing for High-Accuracy (Ozaki II)

5. Hash-Based Polynomial Compression and Sparse Multiplication

6. Structure-Aware Compression: Twin-width and Tree Decompositions

7. Comparative Tradeoffs and Practical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research