Pagh's Compressed Matrix Multiplication

Updated 16 January 2026

The paper introduces a method that uses Count-Sketch style sketching with FFT/FWHT to approximate outer-product computations, ensuring unbiased estimates with controlled variance.
It achieves efficient sparse and heavy-hitter recovery by combining randomized techniques with error-correcting codes to isolate and recover dominant matrix entries.
The work also explores deterministic modular packing, leveraging bit-level modular arithmetic to simultaneously process multiple residues, offering exact computations and speedups.

Pagh’s Compressed Matrix Multiplication algorithm refers to a family of techniques—both randomized and deterministic—that leverage compression, sketching, or modular packing to accelerate or memory-optimize matrix multiplication, particularly for applications in numerical linear algebra, data mining, and heavy-hitter recovery. The archetypal variants are randomized methods that sketch each outer product into a fixed-size data structure using random hashing and convolution (Count-Sketch–style), as well as deterministic packing methods operating over modular arithmetic with direct bit-level packing of multiple residues into a machine word. Distinct instantiations achieve unbiased approximate multiplication, fast recovery of dominant entries in sparse or skewed outputs, and in some regimes, exact computation with high probability and subquadratic resources.

1. Randomized Compressed Matrix Multiplication: Sketching by Hashing and FFT

The core approach, as introduced and developed by Pagh, is to approximate the matrix product $C = AB$ by compressing the computation of all rank-1 outer-products $A_{*,k} B_{k,*}$ into compact Count-Sketch–like objects, enabling dimensionality reduction while preserving the ability to estimate entries of $C$ (Pagh, 2011, Kutzkov, 2012, Andersson et al., 14 Jan 2026). For two $n\times n$ real matrices $A,B$ , choose two 2-wise-independent hash functions $h_1,h_2 : [n]\rightarrow [b]$ and sign functions $s_1,s_2: [n]\rightarrow \{\pm1\}$ . Each entry $(A_{ik}, B_{kj})$ is mapped to a bucket $h(i, j) = (h_1(i) + h_2(j)) \bmod b$ with sign $s(i, j) = s_1(i) s_2(j)$ . For each fixed $k$ , compress the outer product using discrete polynomial construction and convolution:

$P_k(x) = \sum_{i=1}^n s_1(i) A_{ik} x^{h_1(i)}, \quad Q_k(x) = \sum_{j=1}^n s_2(j) B_{kj} x^{h_2(j)},$

$p(x) = \sum_{k=1}^n P_k(x) Q_k(x).$

FFT-based convolution computes all $b$ bucket coefficients $c_0, \dots, c_{b-1}$ in $O(n b \log b)$ per repetition. The estimate for $(AB)_{ij}$ is then

$C_{ij} = s_1(i) s_2(j) c_{(h_1(i) + h_2(j)) \bmod b}.$

This procedure yields an unbiased estimator for each entry: $\mathbb E[C_{ij}] = (AB)_{ij}$ , with variance $\operatorname{Var}[C_{ij}] \leq \| AB \|_F^2 / b$ . Repetition $d = O(\log n)$ and median-aggregation suppress tail probabilities.

2. Exact and Sparse Recovery: Error-Correcting Codes and Heavy Hitters

When $AB$ is sparse with at most $b$ nonzeros, the same sketching approach recovers all significant entries exactly with high probability in $\tilde O(N + nb)$ time, where $N$ is the number of nonzeros in $A$ and $B$ (Pagh, 2011). The technique can be further optimized to recover only the $b$ largest entries using error-correcting codes: bits of row and column indices are encoded, and multiple restricted sketches are performed, selectively zeroing rows or columns. Bucket values across sketches encode an approximate location of large entries; decoding and deduplication then yield candidates, followed by true value recovery through a final decompress step. The total time is $\tilde O(n^2 + n b + b \log n)$ .

3. Algorithmic Details and Complexity

Sketch Construction:

For each $k$ , form sign-weighted polynomial coefficient arrays of length $b$ via $h_1, h_2, s_1, s_2$ .
Use FFT (or, as discussed below, FWHT) for convolution in $O(b \log b)$ .
Maintain $d$ independent sketches in $O(d b)$ space.

Entry Estimation:

For each query $(i, j)$ , reconstruct the signed index and read the corresponding bucket from each sketch.
Output the median of these $d$ values for high-probability correctness.

Computational Trade-offs:

Total time for the approximate algorithm is $\tilde O(n^2 + n b)$ ; for exact recovery with sparse $AB$ , $\tilde O(N + n b)$ (Pagh, 2011).
Choosing larger $b$ improves accuracy (due to reduced collision-induced variance) but increases time and space linearly.

Regime	Time Complexity	Space Complexity	Error / Recovery
Approximate	$\tilde O(n^2 + n b)$	$O(d b)$	Additive error $\leq \\|AB\\|_F / \sqrt{b}$
Exact for $b$ -sparse	$\tilde O(N + n b)$	$O(d b)$	Exact for all nonzeros
Heavy hitters ( $b$ )	$\tilde O(n^2 + n b)$	$O(d b)$	Top $b$ entries with high probability

4. Fast Walsh–Hadamard Transform (FWHT) Accelerated Variant

Recent work introduces a FWHT-based implementation (Andersson et al., 14 Jan 2026), replacing cyclic convolution (modulo $b$ ) with XOR-convolution. Here $b$ is restricted to powers of two, and the bucket index for entry $(i, j)$ becomes $h(i, j) = h_1(i) \oplus h_2(j)$ . The FFT is replaced by FWHT, which uses only $\pm1$ multipliers and in-place updates, offering improved memory locality and constant-factor efficiency. All analytical guarantees—unbiasedness, variance bounds, exact sparse recovery—are retained.

Empirically, the FWHT variant is up to 4 $\times$ faster than FFT-based sketching and, for highly-skewed product matrices, up to 40 $\times$ faster than DGEMM in Intel MKL. Under "heavy hitter" or "light" output conditions, nearly all large entries are accurately recovered; for more challenging output distributions, reliability remains high for the largest entries, even if overall accuracy on minor entries can be less than perfect (Andersson et al., 14 Jan 2026).

5. Deterministic Modular Packing: Dumas–Fousse–Salvy’s Scheme

Orthogonal to randomized sketching, compressed modular matrix multiplication stores multiple residues $\bmod\,p$ in a single machine word and exploits integer multiplication to obtain multiple dot-products simultaneously (0803.1975). Pack $k$ residues using $B \geq \lceil \log_2 p\rceil$ bits each into a word $W = \sum_{i=0}^{k-1} a_i 2^{i B}$ . The dot-product $\sum_{i=0}^{k-1} a_i b_i$ is obtained as the middle coefficient in the full integer product $P = W \cdot R$ , where $R$ is the packed reverse of $b_0, \ldots, b_{k-1}$ . Extraction and reduction are performed via bit-masking and modular arithmetic, with precise parameter constraints (to prevent inter-block carries or overflows):

No carry propagation: $k (p-1)^2 < 2^B$
No word overflow: $k B + \log_2 p \leq w$ , for machine word size $w$

Packed arithmetic for modular addition, subtraction, and scalar multiplication are implemented via combinations of bitwise operations and modular reductions (often batch-reduced via REDQ), with substantial speedups and deterministic, exact answers when parameter bounds are respected. Matrix multiplication operates by dividing the matrices into $k$ -tuples, compressing/padding, and processing via the above routines for asymptotic speedups of roughly $k = \lfloor w / B \rfloor$ (0803.1975).

6. Practical Considerations, Empirical Results, and Trade-offs

Compressed matrix multiplication sketches are efficient in both time and space, provided $b$ is chosen judiciously to balance error versus overhead. In approximate settings, $b = O(\|AB\|_F^2/\epsilon^2)$ suffices for additive error $\pm \epsilon$ (with high probability over repetitions). For sparse outputs, the techniques naturally recover all nonzeros or the largest entries with minimal total work.

FWHT-based implementations improve practical performance via better cache-efficiency and optimal use of arithmetic hardware (Andersson et al., 14 Jan 2026). Key optimizations involve memory layout, parallelization (OpenMP), and in-place transforms. Deterministic packing methods, while restricted to arithmetic modulo small primes and sensible values of $k$ and $p$ , offer exact arithmetic, tight memory use, and speedups independent of output structure (0803.1975).

7. Relation to Other Compression and Data Stream Techniques

Pagh’s randomized sketch extends classic Count-Sketch and streaming heavy-hitter methods by introducing polynomial hashing and FFT/FWHT-accelerated convolution at the outer-product level (Pagh, 2011, Kutzkov, 2012). The modulo-packing variant, by contrast, does not use randomness or hashing but leverages low-level machine arithmetic and polynomial convolution. The two methods—randomized polynomial hashing and deterministic bit-packing—view the dot-product as a middle coefficient of a product of polynomials, but with distinct choices of base-ring: the former in a finite field or complex roots of unity (FFT), the latter in $\mathbb Z[2^B]$ embedded in hardware arithmetic (0803.1975).

Both approaches can be interpreted as compressing the index space to accelerate search for dominant output entries or to process multiple products simultaneously, but they differ in randomness, error tolerance, exactness, and suitability for different matrix or hardware regimes.

Markdown Report Issue Upgrade to Chat

References (4)

Compressed Matrix Multiplication (2011)

Deterministic algorithms for skewed matrix products (2012)

Engineering Compressed Matrix Multiplication with the Fast Walsh-Hadamard Transform (2026)

Compressed Modular Matrix Multiplication (2008)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pagh's Compressed Matrix Multiplication Algorithm.

Pagh's Compressed Matrix Multiplication

1. Randomized Compressed Matrix Multiplication: Sketching by Hashing and FFT

2. Exact and Sparse Recovery: Error-Correcting Codes and Heavy Hitters

3. Algorithmic Details and Complexity

Sketch Construction:

Entry Estimation:

Computational Trade-offs:

4. Fast Walsh–Hadamard Transform (FWHT) Accelerated Variant

5. Deterministic Modular Packing: Dumas–Fousse–Salvy’s Scheme

6. Practical Considerations, Empirical Results, and Trade-offs

7. Relation to Other Compression and Data Stream Techniques

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Pagh's Compressed Matrix Multiplication

1. Randomized Compressed Matrix Multiplication: Sketching by Hashing and FFT

2. Exact and Sparse Recovery: Error-Correcting Codes and Heavy Hitters

3. Algorithmic Details and Complexity

Sketch Construction:

Entry Estimation:

Computational Trade-offs:

4. Fast Walsh–Hadamard Transform (FWHT) Accelerated Variant

5. Deterministic Modular Packing: Dumas–Fousse–Salvy’s Scheme

6. Practical Considerations, Empirical Results, and Trade-offs

7. Relation to Other Compression and Data Stream Techniques

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research