Compression-Aware Scaling Laws

Updated 4 January 2026

Compression-aware scaling laws are predictive frameworks that incorporate data, model, and representation compressibility—using metrics like Kolmogorov complexity and explicit compression ratios—to refine performance estimates.
They unify diverse compression techniques such as quantization, sparsity, and hybrid representations through intrinsic capacity metrics, enabling direct resource allocation and scaling law acceleration.
The framework offers practical strategies for model architecture selection, pruning, and deployment by factoring in hardware constraints, precision effects, and data compressibility measures.

Compression-aware scaling laws generalize classical scaling law frameworks by making model performance predictions sensitive to data, model, and representation compressibility or hardware-induced limitations. Rather than treating number of parameters or data points as the central driver, these laws posit that effective capacity—quantified via measures such as Kolmogorov complexity, representation capacity, precision, or explicit compression ratios—accounts for the true determinants of generalization, optimization dynamics, and resource allocation. Recent advances include polylogarithmic compression regimes, unified scaling expressions applicable to sparse, quantized, and hybrid representations, storage/data compression laws, and multimodal tokenization efficiency formulas. Empirical findings reveal both the steeper decay rates enabled by optimal compression and the practical constraints imposed by hardware or deployment constraints.

1. Universal Compression Theory and Dynamical Scaling Acceleration

The work "A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws" (Wang et al., 1 Oct 2025) developed a constructive proof that permutation-invariant functions of $d$ objects (e.g., neural network layers or datasets) can be losslessly compressed into $d' = O(\log^m(d/\omega(d)))$ weighted objects for any vanishing error $\omega(d)$ . This establishes:

Universal Compression Theorem: Any permutation-invariant, sufficiently smooth function $f: V^d \to \mathbb R$ admits a compressed representation on a much smaller set of weighted points $(w_j', c_j)$ such that

$|f_d(w_1, \dots, w_d) - f_{d'}'(w_1', c_1; \dots; w_{d'}', c_{d'})| \leq \omega(d)$

with $d' = O(\log^m(d/\omega(d)))$ .

Dynamical Lottery Ticket Hypothesis: Applying the universal compression algorithm to the parameter set of a neural network shows that the entire learning dynamics—forward passes and gradient updates—can be matched by a polylogarithmically compressed weighted net, preserving trajectory and result up to vanishing error. "Cluster-peeling" and moment-matching for tensor-moments yield constructive compressed initialization supporting full-batch and stochastic training.
Dataset Compression and Scaling Law Acceleration: The same mechanism compresses datasets for supervised learning. If empirical loss scales as $L \sim d^{-\alpha}$ for dataset size $d$ , compressing to $d' = O(\log^m d)$ achieves a "boosted" scaling: $L \sim \exp(-\alpha' \sqrt[m]{d'})$ . The loss landscape is unchanged (uniformly in parameters), enabling stretched-exponential improvements over classical power law scaling.

These results demonstrate that both network and data scaling laws can be superpolynomially improved under permutation-invariant structure, with simple moment-matching, Tchakaloff-type clustering, and reweighting algorithms providing constructive compressions.

2. Unified Scaling Laws for Compressed Representations

Recent work has unified the prediction of generalization under quantization, sparsity, and hybrid (sparse-quantized, vector-quantized) formats via an intrinsic capacity metric (Panferov et al., 2 Jun 2025, Frantar et al., 23 Feb 2025, Kumar et al., 2024):

Unified Scaling Law:

$\text{Loss}(N, D, R) \simeq A \cdot [N \cdot \rho(R)]^{-\alpha} + B \cdot D^{-\beta} + E$

where $N$ is the uncompressed parameter count, $R$ represents the compressed format, and $\rho(R)$ is the representation capacity—measured by the ability of $R$ to fit random Gaussian data with minimal MSE.

Effective Parameter Count $(N_\text{eff})$ :

$N_\text{eff} = N \cdot \rho(R)$

allowing direct alignment of scaling predictions for any compressed architecture by computing $\rho(R)$ from a single noise-injection or Monte Carlo GMSE measurement.

Composability and Factorization: $\rho$ $ρ$ decomposes multiplicatively across independent compression axes:
- For sparsity $s$ and quantization $q$ :
$\rho(R_{q,s}) = \rho_q(q) \cdot \rho_s(s)$

making Pareto-optimal frontier computation and hyperparameter/model selection tractable.

Empirical Multipliers for Quantization and Sparsity (Frantar et al., 23 Feb 2025):
- Weight-only quantization: $f_w(4) = 0.923$ , $f_w(2) = 0.702$ , $f_w(1) = 0.466$
- Full weight+activation quantization: $f_{full}(8) = 0.857$ , $f_{full}(4) = 0.747$ , $f_{full}(2) = 0.289$ , $f_{full}(1) = 0.067$
- 50% sparsity: $0.871$ efficiency

The capacity-aware scaling principle subsumes diverse compression techniques into one parameter-count scaling law, facilitating model comparison, hardware-dependent format selection, and compute-optimal allocation.

3. Precision- and Post-Training Quantization-Aware Scaling Laws

Scaling laws incorporating variable training and post-training precision (PTQ) have been developed to model both additional loss due to lower bit-width and recovery dynamics (Kumar et al., 2024, Xu et al., 2024, Zhou et al., 26 Aug 2025):

Precision-Aware Extension (Kumar et al., 2024):

$L(N, D, b) = A [N_\text{eff}(b)]^{-\alpha} + B D^{-\beta} + E$

with $N_\text{eff}(b) = N \cdot (1 - e^{-b/\gamma_w}) \cdot (1 - e^{-b/\gamma_a}) \cdot (1 - e^{-b/\gamma_{kv}})$ , recovering dense scaling for large $b$ .

Unified Training and PTQ Degradation Law:

$L(N, D, b_\text{train}, b_\text{post}) = \text{Chinchilla law on } N_\text{eff}(b_\text{train}) + \Delta_\text{PTQ}(N, D, b_\text{train}, b_\text{post})$

where PTQ penalty increases with pretraining data and falls with training-time quantization width.

Task-Stratified Knowledge Scaling (Zhou et al., 26 Aug 2025):

$Acc_\text{task} \approx C_\text{task} \cdot N^{\alpha_\text{task}} \cdot [\log_2 C_b]^{\beta_\text{task}} \cdot G^{\gamma_\text{task}} \cdot [\log_2 B_\text{eff}]^{\delta_\text{task}}$

showing much higher sensitivity of knowledge memorization (recall, factual Q&A) to bit-width and calibration size compared to utilization (reasoning). This informs compression/quantization strategies for deployment under resource constraints.

Landscape-Aware Scaling in PTQ (Xu et al., 2024): Quantized model loss increments scale as $\Delta \text{NLL} \approx C N^{-\alpha} b^{-\beta} \kappa^\gamma$ . Random-forest predictors using prequantization NLL, SQNR, and local curvature yield reliable estimation of post-PTQ performance, guiding the selection of $(N, b)$ along memory-constrained Pareto frontiers.

4. Storage- and Data-Compressibility Dependent Scaling Laws

Data compressibility directly shifts generalization scaling exponents and optimal resource allocation (Pandey, 2024, Mentzer et al., 2024):

Data-Dependent Scaling Law (Pandey, 2024):

$L(N, D, H) = E(H) + \frac{A(H)}{N^{\alpha(H)}} + \frac{B(H)}{D^{\beta(H)}}$

where $H$ is dataset gzip-compressibility. Harder-to-compress data (higher entropy $H$ ) increases the data exponent $\beta(H)$ , making scaling more data-preferent (for code data, $\beta \gg \alpha$ ; for natural language, reverse).

Storage-Limited Scaling Law (Mentzer et al., 2024):

$\text{Err}(n, L) \approx \text{Err}^* + A n^{-\alpha} + B L^{-\beta}$

with two independent exponents for sample size and bits per sample, admitting closed-form minimization $\min_{n,L} A n^{-\alpha}+B L^{-\beta},\,\, nL=s$ under a fixed total storage budget $s$ .

Optimal Compression Allocation:

$n^*(s) \sim s^{\beta/(\alpha+\beta)}, \quad L^*(s) \sim s^{\alpha/(\alpha+\beta)}$

and

$\text{Err}(n^*, L^*) \sim s^{-\nu}, \quad \nu = \frac{\alpha \beta}{\alpha+\beta}$

enabling joint optimization of sample count and compression level under storage constraints.

5. Compression Laws for Model Pruning and Structured Compression

Structured model pruning and recovery fine-tuning have been systematically analyzed in terms of cross-entropy and downstream task accuracy under compression (Sengupta et al., 6 Apr 2025, Rosenfeld, 2021):

Quadratic Loss Increase, Linear Task Degradation:

$\mathcal{L}_\text{int}(r) = \mathcal{L}_0^{\alpha_\text{int}} (1+r)^{\beta_\text{int}}, \quad \beta_\text{int} \approx 2.02$

$P_\text{ext}(r) = P_0^{\alpha_\text{ext}} (1+r)^{\beta_\text{ext}}, \quad \beta_\text{ext} \approx -1.05$

Recovery Fine-Tuning enhances generalization loss by up to $63\%$ and zero-shot accuracy by $14\%$ post-compression (with $D$ tokens fine-tuning). Saturation and criticality analysis provides safe region for aggressive compression before irrecoverable drops.
Compression-empirical Power Laws (Rosenfeld, 2021):

$L \approx \alpha C^{-\beta_C} + \gamma D^{-\beta_D} + \delta N^{-\beta_N} + \epsilon p^{-\beta_p}$

where $p$ is remaining fraction after pruning, giving reproducible compression exponents $\gamma \sim 0.45 - 0.6$ . Nyquist learner conjecture (data bandwidth-limited hypothesis) leads to arbitrarily steep improvements via function bandwidth constraint.

6. Kolmogorov- and Information-Theoretic Foundations

Several recent works ("Unifying Two Types of Scaling Laws" (Wan, 12 Jan 2025), "Understanding LLM Behaviors via Compression" (Pan et al., 13 Apr 2025)) explicitly ground scaling laws in the approximation of conditional Kolmogorov complexity, showing that both training and inference optimization are computable approaches minimizing joint code length:

Unified Scaling via Conditional Kolmogorov Complexity:

$L(N, D, C, T) = L_\infty + A (a N^{\alpha_N} D^{\alpha_D} C^{\alpha_C} + b T)^{-\gamma}$

with $T$ being inference token budget (steps), both phases approximating the minimal achievable code length.

Syntax-Knowledge Model (Pan et al., 13 Apr 2025): Modeling LLM learning as Pitman-Yor-process-driven structure acquisition explains decelerating power-laws for both data and model scaling, phase transition behavior in rare knowledge element acquisition, and hallucination phenomena due to capacity shortfall.

Compression-aware scaling laws thus provide a unifying predictive paradigm across modern ML regimes, bringing together permutation-invariant compression, hardware-attuned representation capacity metrics, data-complexity indicators, precision/quantization constraints, and information-theoretic optimality. The mathematical structure enables automatic trade-off computation—model size, data, format, precision—against desired loss, memory, and compute constraints.

7. Practical Implications and Future Directions

The compression-aware scaling law framework has immediate utility for model-architecture selection, resource allocation, and deployment constraint management:

Hardware/Format Design: Directly rank new quantization and sparsity formats via intrinsic GMSE and $\rho(R)$ , avoiding retraining.
Compression and Quantization: Compute Pareto fronts for $(N, b)$ under memory constraints, predict degradation from PTQ and fine-tuning, and direct calibration efforts.
Data Acquisition and Filtering: Employ gzip or alternative compressibility metrics to guide optimal data-model allocation for compute-optimal scaling frontiers, especially for code and multimodal data.
Cross-Modality Generalization: Multimodal scaling laws provide effective token-count formulas integrating ecology of compression and tokenization efficiency, guiding balanced data ingestion across modalities.
Algorithmic Optimization: Use dynamical lottery ticket theory and moment-matching compressions to dramatically reduce training overhead while preserving learning dynamics.

Ongoing work seeks to extend these laws to non-autoregressive and adaptive architectures, incorporate distributional tokenization/codec effects, and construct model-selection strategies for ever-evolving hardware and deployment targets. The compression-aware scaling paradigm is central for rigorously quantifying and optimizing the interaction between data, representations, and computation in large-scale machine learning.