Scale-Then-Compress in ML

Updated 28 February 2026

Scale-Then-Compress is a methodology that deliberately over-parameterizes models or data to expose latent redundancies, enabling subsequent efficient compression.
It integrates techniques such as model-based scaling, feature expansion, and distributed training with gradient compression to enhance performance and resource efficiency.
The approach applies unified scaling laws to balance resource allocation, ensuring robust generalization and lower storage or computational requirements in practice.

The Scale-Then-Compress approach describes a class of methodologies in machine learning and signal processing that deliberately over-parameterize or over-represent information during an initial "scaling" stage—encompassing model size, feature expansion, resolution, or raw data fidelity—followed by an explicit "compression" phase that leverages data- or model-driven structure to reduce resource requirements while preserving utility. This paradigm has become a dominant principle in neural network training, communication-efficient learning, model deployment, and signal transmission, underpinning diverse techniques in model pruning/quantization, feature map compression, scalable coding, and distributed optimization.

1. Theoretical Principles and Motivation

Scale-Then-Compress exploits the empirical and theoretical observation that over-parameterization during training or representation exposes latent redundancies or structures that are otherwise inaccessible to compression algorithms. Larger models, higher-resolution data, or broader feature expansions often yield solutions that can be compressed far more aggressively—either due to greater flatness in the loss landscape, higher signal redundancy, or concentration of important capacity in a small subset of parameters or bits—as substantiated in deep networks, mixture-of-experts, and scalable coding. This is formalized in several general laws:

Scaling Law with Compression Factor: For a model of $N$ physical parameters in a compressed representation $R$ (e.g., quantized, sparse, or reduced-rank),

$L(N, D; R) = A\,[N \cdot \rho(R)]^{-\alpha} + B\,D^{-\beta} + E,$

where $\rho(R)$ encodes the effective "capacity" of the representation and $L$ denotes a loss metric. This capacity is defined via the mean-squared error of the compressed representation fitted on Gaussian random data and composes multiplicatively for independent compressions (e.g., sparsity and quantization) (Panferov et al., 2 Jun 2025).

Storage Scaling Law: In aggressively data-limited regimes, test error follows a joint power-law:

$\mathrm{Err}(N, L) \approx \mathrm{Err^*} + A N^{-\alpha} + B L^{-\beta},$

subject to a storage constraint $N \cdot L \leq S$ (Mentzer et al., 2024).

The optimal point in training or data representation is most frequently reached by first scaling up model size or data coverage beyond immediate resource budget, then compressing post hoc using empirically or theoretically justified methods.

2. Methodological Taxonomy

The Scale-Then-Compress paradigm occurs in multiple methodological forms, with key variations determined by the mechanism of scaling, the compression strategies, and the analysis or optimization principles:

Model-based Scaling: Training large neural architectures (Transformers, SMoEs) for fewer steps then compressing through quantization, pruning, or expert merging (Li et al., 2020, Li et al., 2023, Na et al., 2022).
Data/Feature Expansion: Using high precision, spatial resolution, or multi-scale data, followed by adaptive feature map compression or scalable coding (Yao et al., 2023, Park et al., 2023, Valsesia et al., 2013).
Communication-efficient Distributed Training: Scaling up gradient, batch, or worker counts during distributed training, then applying compression on gradients or parameter updates, often with tailored signal pre-processing (e.g., residual low-pass filtering) (Chen et al., 2021).
Storage-efficient Dataset Curation: Adjusting the number of data points and bits-per-sample jointly to minimize generalization error under storage limitations, allocating bits and samples according to scaling laws (Mentzer et al., 2024).

3. Model Compression and Robustness to Compression

Empirical studies consistently show that larger neural networks are intrinsically more robust to post-training compression. The mean and variance of quantization and pruning errors per layer both decrease with increasing depth and width. Consequently, heavily compressed large models attain higher accuracy than lightly compressed small models under equivalently strict memory or compute budgets, and Pareto frontiers for (size, accuracy) are achieved by "scale then compress" strategies (Li et al., 2020).

When fine-tuning with Sharpness-Aware Minimization (SAM), networks converge to flat minima, yielding greater tolerance to magnitude pruning and quantization—at fixed sparisty or bitwidth, the drop in downstream performance is 1–3 points smaller than when compressing baselines optimized for sharp minima (Na et al., 2022).

In mixture-of-experts models, MC-SMoE demonstrates that merging experts using routing statistics, followed by low-rank and sparse decompositions of the merged weights, achieves up to 80% memory and 20% FLOPs reductions with negligible accuracy loss, outperforming standard one-shot model pruning or merging (Li et al., 2023).

4. Feature Map and Data Compression Architectures

Scale-Then-Compress principles are operationalized in hardware and signal coding as follows:

Adaptive Scale Feature Map Compression (ASC): Input feature maps are partitioned into cubical-like blocks. For each block, two endpoints are computed to cover the range of values. Intermediate interpolation points are adaptively chosen (linear or log-linear scaling) to best match the local distribution and mapped via thresholding to 3-bit indexes, yielding fixed-rate or variable-rate codes. Independent channel indexing leverages weak inter-channel correlation in DNN features (Yao et al., 2023).
NN-based Scalable Image Compression (COMPASS): The image is coded layerwise: a base layer encodes a low-resolution image, subsequent enhancement layers upsample using an implicit neural field predictor (LIFF) and compress only the residual, not the full image, enabling arbitrary-scale spatial scalability within a single codec (Park et al., 2023).
Compressed Sensing with Inter-layer Prediction: Hybrid transform sensing provides a low-res base preview via structured random projections, with enhancement layers acquiring and coding only the residuals relative to base-layer predictions, consistently achieving 1–3 dB PSNR improvements over non-scalable or non-predictive schemes (Valsesia et al., 2013).

5. Unified Scaling Laws and Quantitative Trade-offs

Recent work formalizes unified scaling laws for compressed representations, defining a capacity metric ( $\rho$ ) for quantized, sparse, or VQ representations by their mean-squared error on random Gaussian data and showing that the effective number of parameters is $N \cdot \rho$ . Empirical fits show model loss across various formats is predicted to within $\leq 1\%$ of actual via these laws. Closed-form comparison of different formats (e.g., INT8, INT4, 2:4 sparsity, vector quantization) enables optimization of compute or memory budgets at design time (Panferov et al., 2 Jun 2025).

For storage-limited dataset construction, joint law optimization shows it is optimal to allocate more samples at lower bits-per-sample, determined explicitly by the exponents in the error law. On Food101 and Cityscapes, this yields 15–20% lower test error than non-compressed or non-optimally compressed baselines under matched storage (Mentzer et al., 2024).

Regime	Scaling Law	Optimal Allocation (given budget $R$ 0)
Model/compute-limited	$R$ 1	Pick $R$ 2, $R$ 3 to maximize $R$ 4
Storage-limited (data)	$R$ 5	$R$ 6

6. Hardware and Systems-Level Implementations

Hardware-conscious design within Scale-Then-Compress leverages both compressed data formats and logic optimization:

ASC hardware demonstrates area-efficient endpoint search, interpolation, and threshold-based mapping, achieving 32× throughput increase (to DDR5-6400 bandwidth levels) with only 7.65× area growth. Gate count is ~6135 for the 8-bit and ~17,196 for the 16-bit encoder/decoder, reflecting strong scalability (Yao et al., 2023).
Distributed Training with ScaleCom applies a scaling operation (low-pass residual filtering) before gradient sparsification and compression, maintaining communication overhead proportional to top-k indices (O(k)), permitting 65–400× gradient compression at scale with ≤0.3% accuracy degradation and matching idealized uncompressed SGD convergence (Chen et al., 2021).

7. Practical Guidelines, Limitations, and Open Challenges

Implementation of the Scale-Then-Compress paradigm follows general guidelines:

Scale model, data, or communication bandwidth as much as resources permit.
Halt training upon reaching the desired loss/validation criterion; compress aggressively for deployment or storage.
In data-constrained learning, optimize bits-per-sample versus sample count against exponents estimated on pilot data.
For model compression, select the compression format with the highest effective capacity (via Gaussian fitting or capacity laws).

Limitations include diminishing returns beyond certain over-parameterization, possible hardware/software support for compressed/low-rank inference, the need for precise estimation of loss exponents, potential modality-dependent effects (e.g., SMoE merging with diverse experts), and requirement for compression methods aligned with learned structure (Li et al., 2023, Panferov et al., 2 Jun 2025, Mentzer et al., 2024).

The paradigm continues to be a central organizing principle for efficient learning, deployment, and transmission in large-scale neural computation and remains an active area of methodological extension.