Compressed Modular Matrix Multiplication
- Compressed modular matrix multiplication is a family of methods that compress multiple modular entries into single hardware words to reduce arithmetic cost and enhance efficiency.
- Techniques such as Q-adic packing, multiword decompositions, Kronecker substitution, and CRT-based slicing optimize resource usage on modern CPUs and GPUs.
- These methods leverage algebraic, combinatorial, and structural insights to achieve scalable, high-throughput exact computations in modular arithmetic.
Compressed modular matrix multiplication refers to a family of algorithmic techniques that accelerate modular (or exact) matrix multiplication via data packing, modular decomposition, dimensionality reduction, or combinatorial compression. These approaches enable efficient use of hardware resources—especially integer and low-precision units, floating-point SIMD engines, or memory-bounded settings—when multiplying matrices over rings such as finite fields or modular integer rings. The main principles involve reducing arithmetic cost, memory bandwidth, or storage by leveraging representations that compress multiple small modular entries into a single hardware word or arithmetic operation, or that allow high-accuracy floating-point emulation by modular slicing. The field encompasses both practical engineering advances and deep connections to number theory and graph-theoretic structure.
1. Q-adic Packing and SWAR Arithmetic
Fundamental to compressed modular arithmetic is the idea of storing modular residues in a base- expansion within a single -bit word, where and . Each -tuple is represented as , permitting packed modular addition and subtraction by word-wise operations with block masks and conditional subtraction, as well as compressed dot products by single multi-word integer multiplication and extraction of the relevant digit.
Table: Q-adic Packing Parameters
| Parameter | Symbol | Typical Value / Constraint |
|---|---|---|
| Word size | 32, 64 | |
| Modulus | ||
| Block bits | ||
| Packing factor | ||
| Base |
Given matrices and , one packs rows and reversed columns, then computes each dot product using a single integer multiply, bitshift, and modular reduction. The -fold inner product is thus compressed into a single operation, and total multiplication cost is reduced by a factor of compared to naïve methods, subject to non-overflow constraints, with explicit bounds and (0803.1975).
2. Multiword and Blockwise Decompositions
For large primes or when leveraging floating-point arithmetic for modular matrix multiplication, multiword (blockwise) decompositions generalize Q-adic packing. Each input matrix (or ) with entries in is expanded as with chosen such that each submatrix contains coefficients bounded to bits for exact representation in floating-point mantissas (Berthomieu et al., 12 Jan 2026).
- For exceeding the half-mantissa threshold (e.g., 26 bits in double precision), two-word decompositions suffice to cover up to the full mantissa range ( bits).
- Each block multiplication is performed by BLAS GEMM routines, with modular reduction and scaling steps driven by the expansion exponents.
This technique enables high-throughput modular matrix multiplication on both CPUs and GPUs for moderately large , with a controlled tradeoff: the number of blockwise matrix multiplications increases as a function of required bits, but covers much larger and achieves high Gflops/s plateau (Berthomieu et al., 12 Jan 2026).
3. Kronecker Substitution and Scalar Packing
Kronecker substitution, also referred to as scalar packing, packs entire vectors of residues into a single high-precision integer via base- expansion, executes a high-precision (possibly homomorphic) multiplication, and then unpacks the inner products by digit extraction (Ramapragada et al., 20 Apr 2025).
- For modulus , block size is chosen such that available word size.
- The packing is , ; their integer product contains as the coefficient at precisely .
- The method avoids CRT and enables dimensionality reduction in the most bandwidth- or arithmetic-bound regimes (0803.1975, Ramapragada et al., 20 Apr 2025).
Practical speedup is observed in modular product code and encrypted implementations, with commonly usable on 64-bit words and 8–12 bit (Ramapragada et al., 20 Apr 2025).
4. CRT-Based Modular Slicing for High-Accuracy (Ozaki II)
The Ozaki Scheme II leverages the Chinese Remainder Theorem (CRT) for efficient floating-point matrix product emulation by compressing floating-point operands to modular integers, multiplying via multiple GEMM calls over small coprime moduli, and reconstructing the integer result (Ozaki et al., 10 Apr 2025):
- Scaling/Truncation: Inputs (FP64) are scaled to integer via diagonal scaling matrices.
- Residue Decomposition: For pairwise-coprime moduli , compute , .
- Parallel GEMMs: Compute for each , with exact integer accumulation.
- CRT Reconstruction: Combine results using , with , .
- Final Unscaling: , .
Key properties include:
- Control of precision through and moduli , with precision increasing linearly in .
- Complexity reduced from in standard Ozaki I to , with for double precision.
- Performance on NVIDIA GH200 shows up to FP64 GEMM throughput; on CPUs, speedup for quadruple-precision emulation (Ozaki et al., 10 Apr 2025).
5. Hash-Based Polynomial Compression and Sparse Multiplication
Randomized polynomial sketching with hash and sign functions compresses the product into a polynomial modulo , with each matrix entry estimated from a small vector of coefficients (Pagh, 2011). The essential steps include:
- Choose buckets, 2-wise independent hashings and signs .
- For each outer product, accumulate signed monomials into polynomials and , then compute and aggregate their convolutions (via FFT).
- Each is then estimated as , with unbiasedness and variance controlled by .
- For sparse products, error-correcting codes and restricted sketches enable exact recovery when the true product is sufficiently sparse, in nearly linear time.
This approach yields a tunable tradeoff between accuracy (additive error) and computation, with arithmetic and space, and can be used for fast approximate or sparse-exact matrix multiplication (Pagh, 2011).
6. Structure-Aware Compression: Twin-width and Tree Decompositions
Structural graph-theoretic compression via twin-width and twin-decomposition encodes matrices as tree-based composites of small bicliques, parameterized by the twin-width (Bonnet et al., 2022). For matrices over finite fields with bounded twin-width:
- A compact tree-like encoding is constructed, with a decomposition width bounded in .
- Matrix product is computed via first-order logic with modular counting (FO+MOD), using block-matrix squaring and projection, in time .
- For and inputs given as twin-decompositions, single-exponential algorithms in achieve time and ultra-fast entry queries.
The key theorems establish closure under FO+MOD transductions, bounded twin-width of products, and explicit algorithms parameterized by structural complexity, enabling efficient modular computation when exploitably structured (Bonnet et al., 2022).
7. Comparative Tradeoffs and Practical Impact
Compressed modular matrix multiplication achieves dimensionality reduction, improved throughput, and hardware-optimized performance by exploiting algebraic, combinatorial, and architectural constraints:
- Q-adic packing and multiword decompositions provide nearly -fold speedup or bit-size scaling when is maximized relative to modulus and word size (0803.1975, Berthomieu et al., 12 Jan 2026).
- CRT-based modular slicing (Ozaki II) bridges floating-point and integer domains for ultra-fast, accurate matrix multiplication and precision-tunable emulation (Ozaki et al., 10 Apr 2025).
- Polynomial sketching enables approximate and sparse-exact multiplication with accuracy-vs-speed tunability, leveraging fast Fourier transforms (Pagh, 2011).
- Graph-structural algorithms leverage twin-width for subquadratic multiplication within families of structured matrices (Bonnet et al., 2022).
- Scalar packing/Kronecker substitution is used in privacy-preserving and encrypted regimes to compress homomorphic/packed arithmetic (Ramapragada et al., 20 Apr 2025).
These methods are not mutually exclusive—hybrid approaches combining compression, structure, and modular slicing are prominent in state-of-the-art high-performance modular arithmetic and exact linear algebra software. The choice among them depends on modulus size, matrix dimension, sparsity or structure, and target hardware.