Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pagh's Compressed Matrix Multiplication

Updated 16 January 2026
  • The paper introduces a method that uses Count-Sketch style sketching with FFT/FWHT to approximate outer-product computations, ensuring unbiased estimates with controlled variance.
  • It achieves efficient sparse and heavy-hitter recovery by combining randomized techniques with error-correcting codes to isolate and recover dominant matrix entries.
  • The work also explores deterministic modular packing, leveraging bit-level modular arithmetic to simultaneously process multiple residues, offering exact computations and speedups.

Pagh’s Compressed Matrix Multiplication algorithm refers to a family of techniques—both randomized and deterministic—that leverage compression, sketching, or modular packing to accelerate or memory-optimize matrix multiplication, particularly for applications in numerical linear algebra, data mining, and heavy-hitter recovery. The archetypal variants are randomized methods that sketch each outer product into a fixed-size data structure using random hashing and convolution (Count-Sketch–style), as well as deterministic packing methods operating over modular arithmetic with direct bit-level packing of multiple residues into a machine word. Distinct instantiations achieve unbiased approximate multiplication, fast recovery of dominant entries in sparse or skewed outputs, and in some regimes, exact computation with high probability and subquadratic resources.

1. Randomized Compressed Matrix Multiplication: Sketching by Hashing and FFT

The core approach, as introduced and developed by Pagh, is to approximate the matrix product C=ABC = AB by compressing the computation of all rank-1 outer-products A,kBk,A_{*,k} B_{k,*} into compact Count-Sketch–like objects, enabling dimensionality reduction while preserving the ability to estimate entries of CC (Pagh, 2011, Kutzkov, 2012, Andersson et al., 14 Jan 2026). For two n×nn\times n real matrices A,BA,B, choose two 2-wise-independent hash functions h1,h2:[n][b]h_1,h_2 : [n]\rightarrow [b] and sign functions s1,s2:[n]{±1}s_1,s_2: [n]\rightarrow \{\pm1\}. Each entry (Aik,Bkj)(A_{ik}, B_{kj}) is mapped to a bucket h(i,j)=(h1(i)+h2(j))modbh(i, j) = (h_1(i) + h_2(j)) \bmod b with sign s(i,j)=s1(i)s2(j)s(i, j) = s_1(i) s_2(j). For each fixed kk, compress the outer product using discrete polynomial construction and convolution:

Pk(x)=i=1ns1(i)Aikxh1(i),Qk(x)=j=1ns2(j)Bkjxh2(j),P_k(x) = \sum_{i=1}^n s_1(i) A_{ik} x^{h_1(i)}, \quad Q_k(x) = \sum_{j=1}^n s_2(j) B_{kj} x^{h_2(j)},

p(x)=k=1nPk(x)Qk(x).p(x) = \sum_{k=1}^n P_k(x) Q_k(x).

FFT-based convolution computes all bb bucket coefficients c0,,cb1c_0, \dots, c_{b-1} in O(nblogb)O(n b \log b) per repetition. The estimate for (AB)ij(AB)_{ij} is then

Cij=s1(i)s2(j)c(h1(i)+h2(j))modb.C_{ij} = s_1(i) s_2(j) c_{(h_1(i) + h_2(j)) \bmod b}.

This procedure yields an unbiased estimator for each entry: E[Cij]=(AB)ij\mathbb E[C_{ij}] = (AB)_{ij}, with variance Var[Cij]ABF2/b\operatorname{Var}[C_{ij}] \leq \| AB \|_F^2 / b. Repetition d=O(logn)d = O(\log n) and median-aggregation suppress tail probabilities.

2. Exact and Sparse Recovery: Error-Correcting Codes and Heavy Hitters

When ABAB is sparse with at most bb nonzeros, the same sketching approach recovers all significant entries exactly with high probability in O~(N+nb)\tilde O(N + nb) time, where NN is the number of nonzeros in AA and BB (Pagh, 2011). The technique can be further optimized to recover only the bb largest entries using error-correcting codes: bits of row and column indices are encoded, and multiple restricted sketches are performed, selectively zeroing rows or columns. Bucket values across sketches encode an approximate location of large entries; decoding and deduplication then yield candidates, followed by true value recovery through a final decompress step. The total time is O~(n2+nb+blogn)\tilde O(n^2 + n b + b \log n).

3. Algorithmic Details and Complexity

Sketch Construction:

  • For each kk, form sign-weighted polynomial coefficient arrays of length bb via h1,h2,s1,s2h_1, h_2, s_1, s_2.
  • Use FFT (or, as discussed below, FWHT) for convolution in O(blogb)O(b \log b).
  • Maintain dd independent sketches in O(db)O(d b) space.

Entry Estimation:

  • For each query (i,j)(i, j), reconstruct the signed index and read the corresponding bucket from each sketch.
  • Output the median of these dd values for high-probability correctness.

Computational Trade-offs:

  • Total time for the approximate algorithm is O~(n2+nb)\tilde O(n^2 + n b); for exact recovery with sparse ABAB, O~(N+nb)\tilde O(N + n b) (Pagh, 2011).
  • Choosing larger bb improves accuracy (due to reduced collision-induced variance) but increases time and space linearly.
Regime Time Complexity Space Complexity Error / Recovery
Approximate O~(n2+nb)\tilde O(n^2 + n b) O(db)O(d b) Additive error ABF/b\leq \|AB\|_F / \sqrt{b}
Exact for bb-sparse O~(N+nb)\tilde O(N + n b) O(db)O(d b) Exact for all nonzeros
Heavy hitters (bb) O~(n2+nb)\tilde O(n^2 + n b) O(db)O(d b) Top bb entries with high probability

4. Fast Walsh–Hadamard Transform (FWHT) Accelerated Variant

Recent work introduces a FWHT-based implementation (Andersson et al., 14 Jan 2026), replacing cyclic convolution (modulo bb) with XOR-convolution. Here bb is restricted to powers of two, and the bucket index for entry (i,j)(i, j) becomes h(i,j)=h1(i)h2(j)h(i, j) = h_1(i) \oplus h_2(j). The FFT is replaced by FWHT, which uses only ±1\pm1 multipliers and in-place updates, offering improved memory locality and constant-factor efficiency. All analytical guarantees—unbiasedness, variance bounds, exact sparse recovery—are retained.

Empirically, the FWHT variant is up to 4×\times faster than FFT-based sketching and, for highly-skewed product matrices, up to 40×\times faster than DGEMM in Intel MKL. Under "heavy hitter" or "light" output conditions, nearly all large entries are accurately recovered; for more challenging output distributions, reliability remains high for the largest entries, even if overall accuracy on minor entries can be less than perfect (Andersson et al., 14 Jan 2026).

5. Deterministic Modular Packing: Dumas–Fousse–Salvy’s Scheme

Orthogonal to randomized sketching, compressed modular matrix multiplication stores multiple residues modp\bmod\,p in a single machine word and exploits integer multiplication to obtain multiple dot-products simultaneously (0803.1975). Pack kk residues using Blog2pB \geq \lceil \log_2 p\rceil bits each into a word W=i=0k1ai2iBW = \sum_{i=0}^{k-1} a_i 2^{i B}. The dot-product i=0k1aibi\sum_{i=0}^{k-1} a_i b_i is obtained as the middle coefficient in the full integer product P=WRP = W \cdot R, where RR is the packed reverse of b0,,bk1b_0, \ldots, b_{k-1}. Extraction and reduction are performed via bit-masking and modular arithmetic, with precise parameter constraints (to prevent inter-block carries or overflows):

  • No carry propagation: k(p1)2<2Bk (p-1)^2 < 2^B
  • No word overflow: kB+log2pwk B + \log_2 p \leq w, for machine word size ww

Packed arithmetic for modular addition, subtraction, and scalar multiplication are implemented via combinations of bitwise operations and modular reductions (often batch-reduced via REDQ), with substantial speedups and deterministic, exact answers when parameter bounds are respected. Matrix multiplication operates by dividing the matrices into kk-tuples, compressing/padding, and processing via the above routines for asymptotic speedups of roughly k=w/Bk = \lfloor w / B \rfloor (0803.1975).

6. Practical Considerations, Empirical Results, and Trade-offs

Compressed matrix multiplication sketches are efficient in both time and space, provided bb is chosen judiciously to balance error versus overhead. In approximate settings, b=O(ABF2/ϵ2)b = O(\|AB\|_F^2/\epsilon^2) suffices for additive error ±ϵ\pm \epsilon (with high probability over repetitions). For sparse outputs, the techniques naturally recover all nonzeros or the largest entries with minimal total work.

FWHT-based implementations improve practical performance via better cache-efficiency and optimal use of arithmetic hardware (Andersson et al., 14 Jan 2026). Key optimizations involve memory layout, parallelization (OpenMP), and in-place transforms. Deterministic packing methods, while restricted to arithmetic modulo small primes and sensible values of kk and pp, offer exact arithmetic, tight memory use, and speedups independent of output structure (0803.1975).

7. Relation to Other Compression and Data Stream Techniques

Pagh’s randomized sketch extends classic Count-Sketch and streaming heavy-hitter methods by introducing polynomial hashing and FFT/FWHT-accelerated convolution at the outer-product level (Pagh, 2011, Kutzkov, 2012). The modulo-packing variant, by contrast, does not use randomness or hashing but leverages low-level machine arithmetic and polynomial convolution. The two methods—randomized polynomial hashing and deterministic bit-packing—view the dot-product as a middle coefficient of a product of polynomials, but with distinct choices of base-ring: the former in a finite field or complex roots of unity (FFT), the latter in Z[2B]\mathbb Z[2^B] embedded in hardware arithmetic (0803.1975).

Both approaches can be interpreted as compressing the index space to accelerate search for dominant output entries or to process multiple products simultaneously, but they differ in randomness, error tolerance, exactness, and suitability for different matrix or hardware regimes.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pagh's Compressed Matrix Multiplication Algorithm.