Universal Compression Theorem

Updated 26 October 2025

Universal Compression Theorem is a framework that guarantees near-entropy optimal compression by using moment matching to capture low-order symmetric statistics.
It demonstrates that compressed representations can preserve both final outputs and full training dynamics in neural networks through careful moment preservation.
The theory enables practical applications like memory-efficient deep learning and active data selection, offering explicit error bounds and resource savings.

The Universal Compression Theorem formalizes the asymptotic optimality of compression algorithms that operate without prior knowledge of the source, guaranteeing that such systems can universally achieve performance near the fundamental entropy limits across broad classes of individual sequences, sources, and data models. This principle has been refined and extended by a diverse body of work spanning classical finite‐state compressors, universal code constructions for power‐law and structured sources, advanced settings with side information and distributed systems, and, more recently, permutation‐invariant models that underlie scaling laws in modern neural networks.

1. Symmetry, Sufficient Statistics, and Theoretical Foundations

A central insight underpinning the Universal Compression Theorem is that for many function classes relevant to data compression and learning, especially those that are permutation‐invariant, the value of the function (e.g., the loss or compressed length) is determined by low-order symmetric statistics of the objects (such as power sums or empirical moments). In the context of neural networks or dataset compression, this manifests in the following canonical form: $f(w_1, \ldots, w_d) = \rho\left(\sum_{i=1}^d g(w_i)\right)$ where each $w_i\in \mathbb{R}^m$ , and $\rho,g$ are smooth, analytic functions.

By leveraging classical results from the theory of symmetric polynomials (for scalar inputs) and their extensions to the multivariate case, any symmetric polynomial can be represented uniquely as a function of power sums or, more generally, tensor moments: $p_k = \frac{1}{d}\sum_{i=1}^d w_i^{\otimes k}$ for $k=1,2,\ldots$ . The order to which moments must be matched is governed by the function's smoothness (e.g., via the Taylor expansion truncation error).

Tchakaloff's theorem further guarantees the existence of a reweighted subset (with support size $N_{m,k} = \binom{m+k}{k}$ ) that exactly preserves these low-order moments, thereby enabling compression of the original collection of $d$ objects to $d' = \text{polylog}(d)$ representatives, while exactly or approximately preserving the value of the target function (e.g., compression ratio, loss).

2. Dynamical Lottery Ticket Hypothesis

A key application of this structural theory is to the so-called "dynamical lottery ticket hypothesis." The theorem shows that for any permutation-symmetric neural network model (or loss function) and any training map (e.g., a sequence of SGD steps), both the output and the entire learning dynamics are determined by the empirical distribution of the parameters. By applying moment-matching compression to the initial parameters, one constructs a compressed, weighted set of parameters (of size $O(\log^m d)$ for $d$ neurons of dimension $m$ ) that, under an adjusted dynamics, yields an output and trajectory within any prescribed error $\omega(d) \to 0$ as $d\to\infty$ : $\max_{T\in \mathcal{T}} |f(T(\theta)) - f'(T'(\theta'))| \leq \omega(d)$ where $T'$ denotes the compressed dynamics, respecting the weights of the compressed components (e.g., scaling gradients appropriately).

This result establishes a strong (i.e., dynamical) form of the lottery ticket hypothesis: not only does there exist a subnetwork achieving similar final performance, but the entire training path and all intermediate states are closely preserved under severe width compression.

3. Implications for Neural and Data Scaling Laws

Traditional neural scaling laws posit a slow power-law decay of error as a function of resource size: $L(d) \sim d^{-\alpha},\quad \alpha>0$ where $d$ denotes parameter count or dataset size. The universal compression theory demonstrates that, due to permutation symmetry, one can compress $d$ objects to $d' = \text{polylog}(d)$ and preserve the crucial statistical features (moments) to any prescribed degree, yielding: $L(d') = L_0 + \exp(-\alpha' d'^{1/m})$ for some $L_0, \alpha'>0$ . This boosts the scaling from a slow power law to a stretched exponential, with the error decreasing superpolynomially rather than polynomially. The achievable error rate is controlled by the truncation order $k$ in the moment matching: $\left| f(\{w_i\}_{i=1}^d) - f(\{v_j\}_{j=1}^{d'}) \right| \leq O\left(d \cdot r^{k+1}\right)$ where $r$ denotes the maximal cluster diameter following compression.

A plausible implication is that, when designing large models or datasets for which the output is governed by permutation-invariant objectives, one can achieve substantial resource savings without loss in attainable error rates, provided the compression is carefully constructed to match moment structure.

4. Constructive Compression via Moment Matching

The constructive aspect of the universal compression theorem relies on optimal moment matching—replacing $d$ objects with a reweighted set of $d'$ "virtual points" that match the empirical moments up to order $k$ . The size $d'$ required is upper bounded by the number of monomials in $m$ variables of degree at most $k$ ( $N_{m,k}$ ). Tchakaloff's theorem ensures existence, but finding such a set is NP-hard in general; thus, heuristic clustering algorithms (such as k-means, with $k$ tuned to the number of moments) are employed in practice.

The compressed objects are then assigned weights, and both evaluation and gradient-based optimization are performed respecting these weights, e.g., scaling updates to each unique $w'_i$ by $1/c_i$ . The error bounds for compression are controlled by the smoothness of $\rho$ and $g$ , and the quality of moment matching.

Specifically, suppose $f(w_1,\ldots,w_d) = \rho(\frac{1}{d}\sum_{i=1}^d w_i^{\otimes k})$ ; then, compress to $d'$ objects $\{v_j\}$ with weights $\{c_j\}$ such that

$\sum_{i=1}^d w_i^{\otimes \ell} = \sum_{j=1}^{d'} c_j v_j^{\otimes \ell}, \quad \forall \ell\leq k$

This suffices to make $f(\{w_i\})$ and $f'(\{v_j,c_j\})$ agree for all polynomial and analytic $f$ up to the truncation error in their Taylor expansions.

5. Comparative Perspective and Limitations

Universal compression under this theory is fundamentally different from classical pruning, quantization, or low-rank approximation strategies. Key features:

Universality: The theory applies to any permutation-symmetric function, including those underlying neural models, dataset objectives, or loss functions.
Theoretical Boundaries: It gives explicit, smoothness-controlled error bounds for function approximation after compression.
Preservation of Learning Dynamics: The full dynamics of optimization (not merely the minimizer) are preserved, a property not guaranteed by post hoc pruning or quantization.

Limitations and practical challenges include:

Computational Complexity: While existence is guaranteed (constructive via Tchakaloff), finding the minimal moment-matching set is NP-hard; practical methods rely on scalable but possibly suboptimal clustering heuristics such as k-means.
Curse of Dimensionality: The size $d'$ grows rapidly with the number and order of moments required ( $N_{m,k}$ ), potentially limiting compression in high-dimensional settings or when high-accuracy is necessary.
Required Smoothness: The approach relies on the analytic (or at least $C^k$ ) structure of the underlying functions; for non-smooth objectives, error bounds may not be as sharp.

6. Practical Applications and Experimental Validation

Empirical evidence in the paper corroborates the theoretical predictions. In a teacher–student learning scenario, datasets compressed via moment-matching yield nearly the same test performance as the original, full dataset, whereas random subsets of equal size result in significant performance degradation. This effect is robust to the choice of training method (Adam, SGD, etc.) and extends to practical datasets and models used in deep learning.

Other promising applications include:

Memory-efficient training of deep networks via initialization with compressed parameters.
Active data selection strategies that select a representative, highly informative subset by moment-matching.
Loss landscape preservation for optimization and transfer learning, as the compressed data or model parameters retain the full structure of the functional being minimized.

7. Broader Impact and Future Directions

The universal compression theory reveals a fundamental connection between statistical symmetry, information representation, and scaling in high-dimensional learning systems. By unifying function approximation, dataset compression, and parameter reduction through the lens of symmetric statistics, it implies that tractable and efficient summary representations are possible even in massively overparameterized regimes. Future extensions may explore more efficient approximate algorithms for moment matching, handle more general (non-analytic or non-symmetric) classes, or integrate these insights into the design of scalable, adaptive learning architectures.

A plausible implication is the potential for dramatic reductions in computational and memory requirements for both training and deploying large-scale models by exploiting structure that is universal, rather than problem- or architecture-specific, as rigorously quantified by the universal compression theorem (Wang et al., 1 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws (2025)

Follow Topic

Get notified by email when new papers are published related to Universal Compression Theorem.