Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compressed Modular Matrix Multiplication

Updated 25 February 2026
  • Compressed modular matrix multiplication is a family of methods that compress multiple modular entries into single hardware words to reduce arithmetic cost and enhance efficiency.
  • Techniques such as Q-adic packing, multiword decompositions, Kronecker substitution, and CRT-based slicing optimize resource usage on modern CPUs and GPUs.
  • These methods leverage algebraic, combinatorial, and structural insights to achieve scalable, high-throughput exact computations in modular arithmetic.

Compressed modular matrix multiplication refers to a family of algorithmic techniques that accelerate modular (or exact) matrix multiplication via data packing, modular decomposition, dimensionality reduction, or combinatorial compression. These approaches enable efficient use of hardware resources—especially integer and low-precision units, floating-point SIMD engines, or memory-bounded settings—when multiplying matrices over rings such as finite fields or modular integer rings. The main principles involve reducing arithmetic cost, memory bandwidth, or storage by leveraging representations that compress multiple small modular entries into a single hardware word or arithmetic operation, or that allow high-accuracy floating-point emulation by modular slicing. The field encompasses both practical engineering advances and deep connections to number theory and graph-theoretic structure.

1. Q-adic Packing and SWAR Arithmetic

Fundamental to compressed modular arithmetic is the idea of storing kk modular residues in a base-Q=2bQ=2^b expansion within a single ww-bit word, where p<Qp < Q and k=w/bk = \lfloor w/b \rfloor. Each kk-tuple (a0,,ak1)modp(a_0,\ldots,a_{k-1}) \bmod p is represented as A=i=0k1aiQiA = \sum_{i=0}^{k-1} a_i Q^i, permitting packed modular addition and subtraction by word-wise operations with block masks and conditional subtraction, as well as compressed dot products by single multi-word integer multiplication and extraction of the relevant digit.

Table: Q-adic Packing Parameters

Parameter Symbol Typical Value / Constraint
Word size ww 32, 64
Modulus pp p<2wp < 2^w
Block bits b=log2pb=\lceil\log_2 p\rceil
Packing factor kk w/b\lfloor w/b \rfloor
Base QQ 2b2^b

Given matrices AA and BB, one packs rows and reversed columns, then computes each dot product using a single integer multiply, bitshift, and modular reduction. The kk-fold inner product is thus compressed into a single operation, and total multiplication cost is reduced by a factor of kk compared to naïve methods, subject to non-overflow constraints, with explicit bounds k(p1)2<2bk(p-1)^2 < 2^b and kbwkb \leq w (0803.1975).

2. Multiword and Blockwise Decompositions

For large primes or when leveraging floating-point arithmetic for modular matrix multiplication, multiword (blockwise) decompositions generalize Q-adic packing. Each input matrix AA (or BB) with entries in {0,,p1}\{0,\dots,p-1\} is expanded as A=i=0u12βiAiA = \sum_{i=0}^{u-1} 2^{\beta i} A_i with β\beta chosen such that each submatrix AiA_i contains coefficients bounded to β\beta bits for exact representation in floating-point mantissas (Berthomieu et al., 12 Jan 2026).

  • For pp exceeding the half-mantissa threshold (e.g., 26 bits in double precision), two-word decompositions u=2u=2 suffice to cover up to the full mantissa range (52\sim52 bits).
  • Each block multiplication is performed by BLAS GEMM routines, with modular reduction and scaling steps driven by the expansion exponents.

This technique enables high-throughput modular matrix multiplication on both CPUs and GPUs for moderately large pp, with a controlled tradeoff: the number of blockwise matrix multiplications increases as a function of required bits, but covers much larger pp and achieves high Gflops/s plateau (Berthomieu et al., 12 Jan 2026).

3. Kronecker Substitution and Scalar Packing

Kronecker substitution, also referred to as scalar packing, packs entire vectors of residues into a single high-precision integer via base-RR expansion, executes a high-precision (possibly homomorphic) multiplication, and then unpacks the inner products by digit extraction (Ramapragada et al., 20 Apr 2025).

  • For modulus p<Rp < R, block size kk is chosen such that klog2pk \lceil \log_2 p \rceil \leq available word size.
  • The packing is A=a1+a2R+...+akRk1A = a_1 + a_2 R + ... + a_k R^{k-1}, B=bk+bk1R+...+b1Rk1B = b_k + b_{k-1} R + ... + b_1 R^{k-1}; their integer product C=ABC = A B contains as the coefficient at Rk1R^{k-1} precisely i=1kaibi\sum_{i=1}^k a_i b_i.
  • The method avoids CRT and enables dimensionality reduction in the most bandwidth- or arithmetic-bound regimes (0803.1975, Ramapragada et al., 20 Apr 2025).

Practical speedup is observed in modular product code and encrypted implementations, with k=3,4,5k=3,4,5 commonly usable on 64-bit words and 8–12 bit pp (Ramapragada et al., 20 Apr 2025).

4. CRT-Based Modular Slicing for High-Accuracy (Ozaki II)

The Ozaki Scheme II leverages the Chinese Remainder Theorem (CRT) for efficient floating-point matrix product emulation by compressing floating-point operands to modular integers, multiplying via multiple GEMM calls over small coprime moduli, and reconstructing the integer result (Ozaki et al., 10 Apr 2025):

  1. Scaling/Truncation: Inputs A,BA,B (FP64) are scaled to integer A,BA',B' via diagonal scaling matrices.
  2. Residue Decomposition: For ss pairwise-coprime moduli m1,,msm_1,\ldots,m_s, compute A(t)=AmodmtA'^{(t)} = A' \bmod m_t, B(t)=BmodmtB'^{(t)} = B' \bmod m_t.
  3. Parallel GEMMs: Compute C(t)=A(t)B(t)C^{(t)} = A'^{(t)} B'^{(t)} for each tt, with exact integer accumulation.
  4. CRT Reconstruction: Combine results using Y=t=1sC(t)MtytmodMY = \sum_{t=1}^s C^{(t)} M_t y_t \bmod M, with Mt=M/mtM_t = M/m_t, yt=Mt1modmty_t = M_t^{-1} \bmod m_t.
  5. Final Unscaling: C=D1XE1C = D^{-1} X E^{-1}, X=Yround(Y/M)MX = Y - \mathrm{round}(Y/M) M.

Key properties include:

  • Control of precision through ss and moduli mtm_t, with precision increasing linearly in ss.
  • Complexity reduced from O(k2pqr)O(k^2 p q r) in standard Ozaki I to O(spqr)O(s p q r), with sk2/2s \ll k^2 /2 for double precision.
  • Performance on NVIDIA GH200 shows up to 1.3×1.3\times FP64 GEMM throughput; on CPUs, 2.29×2.29\times speedup for quadruple-precision emulation (Ozaki et al., 10 Apr 2025).

5. Hash-Based Polynomial Compression and Sparse Multiplication

Randomized polynomial sketching with hash and sign functions compresses the n×nn\times n product ABAB into a polynomial modulo xb1x^b-1, with each matrix entry estimated from a small vector of coefficients (Pagh, 2011). The essential steps include:

  1. Choose bb buckets, 2-wise independent hashings h1,h2h_1,h_2 and signs s1,s2s_1,s_2.
  2. For each outer product, accumulate signed monomials into polynomials P1(x)P_1(x) and P2(x)P_2(x), then compute and aggregate their convolutions (via FFT).
  3. Each CijC_{ij} is then estimated as s1(i)s2(j)c(h1(i)+h2(j))modbs_1(i) s_2(j) c_{(h_1(i)+h_2(j)) \bmod b}, with unbiasedness and variance controlled by  ⁣ABF2/b\|\!AB\|_F^2 / b.
  4. For sparse products, error-correcting codes and restricted sketches enable exact recovery when the true product is sufficiently sparse, in nearly linear time.

This approach yields a tunable tradeoff between accuracy (additive error) and computation, with O~(n2+nb)\tilde O(n^2 + nb) arithmetic and O(blogn)O(b\log n) space, and can be used for fast approximate or sparse-exact matrix multiplication (Pagh, 2011).

6. Structure-Aware Compression: Twin-width and Tree Decompositions

Structural graph-theoretic compression via twin-width and twin-decomposition encodes matrices as tree-based composites of small bicliques, parameterized by the twin-width dd (Bonnet et al., 2022). For matrices over finite fields with bounded twin-width:

  • A compact tree-like encoding is constructed, with a decomposition width bounded in dd.
  • Matrix product ABAB is computed via first-order logic with modular counting (FO+MOD), using block-matrix squaring and projection, in time Od,q(n2logn)O_{d,q}(n^2 \log n).
  • For F2\mathbb{F}_2 and inputs given as twin-decompositions, single-exponential algorithms in dd achieve 4d+o(d)n4^{d+o(d)} n time and ultra-fast entry queries.

The key theorems establish closure under FO+MOD transductions, bounded twin-width of products, and explicit algorithms parameterized by structural complexity, enabling efficient modular computation when exploitably structured (Bonnet et al., 2022).

7. Comparative Tradeoffs and Practical Impact

Compressed modular matrix multiplication achieves dimensionality reduction, improved throughput, and hardware-optimized performance by exploiting algebraic, combinatorial, and architectural constraints:

  • Q-adic packing and multiword decompositions provide nearly kk-fold speedup or bit-size scaling when kk is maximized relative to modulus and word size (0803.1975, Berthomieu et al., 12 Jan 2026).
  • CRT-based modular slicing (Ozaki II) bridges floating-point and integer domains for ultra-fast, accurate matrix multiplication and precision-tunable emulation (Ozaki et al., 10 Apr 2025).
  • Polynomial sketching enables approximate and sparse-exact multiplication with accuracy-vs-speed tunability, leveraging fast Fourier transforms (Pagh, 2011).
  • Graph-structural algorithms leverage twin-width for subquadratic multiplication within families of structured matrices (Bonnet et al., 2022).
  • Scalar packing/Kronecker substitution is used in privacy-preserving and encrypted regimes to compress homomorphic/packed arithmetic (Ramapragada et al., 20 Apr 2025).

These methods are not mutually exclusive—hybrid approaches combining compression, structure, and modular slicing are prominent in state-of-the-art high-performance modular arithmetic and exact linear algebra software. The choice among them depends on modulus size, matrix dimension, sparsity or structure, and target hardware.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compressed Modular Matrix Multiplication.