Model Compression: Methods and Optimization

Updated 26 February 2026

Model compression is a set of techniques that reduce deep learning model sizes and computational demand while maintaining high predictive accuracy.
It employs methods such as pruning, quantization, low-rank decompositions, and teacher-student distillation, formalizing compression as constrained optimization problems.
These techniques enable efficient deployment on edge devices and enhance robustness and fairness by integrating advanced optimization and rate–distortion theories.

Model compression refers to a family of methods and theoretical frameworks for reducing the inference-time memory, storage, and compute costs of large machine learning models—principally deep neural networks—while maintaining minimal loss in predictive accuracy or other downstream performance targets. Compression is motivated by the increasing parameter counts and computational demands of state-of-the-art models, which challenge deployment in edge, embedded, and production settings. Mechanisms include, but are not limited to, parameter pruning, quantization, low-rank and structured decompositions, teacher–student distillation, weight sharing via hashing or clustering, and lossless entropy coding. Modern work formalizes compression as constrained or multi-objective optimization, often leveraging advances in optimization theory, rate–distortion analysis, information theory, and robust statistics.

1. Foundational Optimization and Theoretical Frameworks

Compression is fundamentally a constrained optimization problem: minimize task loss (e.g., cross-entropy) subject to nontrivial structure or coding constraints on the parameterization (Carreira-Perpiñán, 2017). Let $w\in\mathbb R^p$ be the full model parameters, $\theta\in\mathbb R^q$ the compressed representation, and $\Theta(\theta)$ the decompression function. The canonical problem is

$\min_{w,\,\theta}\;\; L(w)\;\;\;\text{s.t.}\;\; w = \Theta(\theta)$

where $L(w)$ is the task loss. This includes as special cases quantization (discrete codebook constraints), low-rank decomposition (manifold constraints), pruning ( $\ell_0$ constraints), and combinations thereof. Optimization is typically realized via the augmented Lagrangian and an alternating scheme: (i) “learning” steps updating $w$ under a quadratic penalty towards $\Theta(\theta)$ , (ii) projection/compression steps updating $\theta$ to minimize $\|w - \Theta(\theta)\|$ (Carreira-Perpiñán, 2017).

In the rate–distortion framework, one may treat compression as minimizing mutual information or coding rate $I(w;\hat w)$ for a given allowable distortion $D$ , typically measured in accuracy drop or function deviation (Bu et al., 2019, Kampeas et al., 2023). The trade-off between increased empirical risk (distortion) and decreased generalization error (due to reduced overfitting and mutual information) governs the optimal achievable population risk under compression (Bu et al., 2019).

Information-theoretic and rate–distortion optimality has been established, for instance, for single-parameter, rotation-invariant quantization schemes (Rotation-Invariant Quantization, RIQ), where a single global quantizer parameter achieves per-layer optimal mixed-precision, leveraging the spherical symmetry of the quantization error (Kampeas et al., 2023).

2. Principal Techniques and Methodological Taxonomy

Pruning

Pruning sets a subset of weights to zero based on importance criteria such as magnitude or Hessian sensitivity. Both unstructured (elementwise) and structured (filter/channel-wise) sparsity patterns arise. Typical masking is defined by sparsity $s$ (fraction of zeros) and implemented via thresholding and fine-tuning. Structured multi-hashing further replaces individual weights with hash-based or global low-rank-reduced indices over a much smaller shared parameter pool (Eban et al., 2019).

Quantization

Quantization reduces parameter bit-width, mapping floating-point weights to a finite codebook, often via uniform, non-uniform, or block-based $k$ -means vector quantization. Post-Training Quantization (PTQ) uses the full model post hoc with calibration for scale/zero-point. Quantization-Aware Training (QAT) incorporates quantization during model training via STE. Rotation-invariant and rate–distortion optimal quantization methods automatically yield mixed-precision per layer, maximizing compression subject to a global performance budget (Kampeas et al., 2023).

Low-Rank and Structured Decompositions

Factorizing weight matrices as products of lower-rank tensors (e.g., $U V$ or higher-order tensor decompositions) allows for parameter and computational reduction, often with negligible accuracy loss (Carreira-Perpiñán, 2017). Structured multi-hashing unifies hashing and low-rank reductions, globally controlling the model size directly (Eban et al., 2019).

Teacher–Student Distillation

Knowledge distillation trains a smaller “student” model to match the outputs (responses, intermediate features, or their relations) of a larger “teacher” network using (softened) MSE, KL-divergence, or optimal transport-based losses (Lohit et al., 2020). For time-series and model ensembling, student–teacher distillation compresses dynamic, heterogeneous ensembles into single surrogate models, drastically reducing memory and compute—at times with improved generalization (Cerqueira et al., 2021).

Hashing, Clustering, and Codebook Methods

Advanced techniques use block-wise weight clustering and codebooks (e.g., ClusComp) (Liao et al., 17 Mar 2025) or product quantization with gradient-weighted $k$ -means (Sakthi et al., 2022) to push compression below 2 bits/parameter while supporting downstream (LoRA-style) parameter-efficient finetuning. Hyper-Compression replaces weight storage entirely with low-dimensional dynamical system codes (hyperfunctions), achieving high compression ratios without architectural alteration or retraining (Fan et al., 2024).

3. Unified and Automated Compression Pipelines

Recent advances emphasize joint or automated compression pipelines rather than isolated techniques. Notably:

Unified Constrained Optimization: ATMC enforces pruning, quantization, and factorization jointly under a global budget, solving the resulting min–max objective (accuracy + adversarial robustness) via an ADMM-based method alternating between adversarial training (PGD), sparsity projection, quantized $k$ -means clustering, and dual updates (Gui et al., 2019).
Automated Policy Search: AMC frames compression as sequential decision-making (per-layer ratio selection) via reinforcement learning, efficiently exploring the per-layer sparsity/precision trade space subject to global accuracy or resource constraints (He et al., 2018).
Regularization via Intermittent Compression: DeepTwist inserts periodic weight distortion (pruning/quantization/low-rank projections) as structured “noise,” improving both compression and generalization (Lee et al., 2018).

Recent frameworks enable practitioner-tunable trade-offs via a small number of global hyperparameters (e.g., total sparsity, global codebook size, a distortion interval), with practitioners sweeping these to locate desired size/accuracy/budget frontiers (Lee et al., 2018, Kampeas et al., 2023).

4. Practical Impact, Benchmarks, and Compression Ratios

Model compression achieves practical reductions of $\sim$ 10×–50× in model size for vision and LLMs with $\leq$ 1% accuracy loss using combinations of pruning, quantization, and low-rank factorization (Ishtiaq et al., 2021, Kampeas et al., 2023). For example:

VGG-16 on ImageNet: $\times$ 19.4 compression with $<0.4\%$ acc drop under RIQ (Kampeas et al., 2023).
ResNet-32 (CIFAR-10): 75% variable reduction, no test loss with structured multi-hashing (Eban et al., 2019).
LLaMA2-7B: $\times$ 2.6–32 compression (Hyper-Compression, ClusComp) with $~1\%$ drop or less, without retraining (Liao et al., 17 Mar 2025, Fan et al., 2024).
On-device/MCU deployment: 12–20× model and footprint reduction and $\sim$ 2× inference speedup with pruned and quantized sparse representations (Dogan et al., 2021).

Empirical results underscore the necessity of post-compression fine-tuning in many regimes, particularly at high compression ratios or when accuracy loss must be tightly bounded. Hybrid pipelines (distillation $\rightarrow$ pruning, codebook-based quantization + blockwise finetuning, double compression with lossless encoding) are state of the art in LLMs and mobile deployment (Wang et al., 21 Feb 2025, Liao et al., 17 Mar 2025, Fan et al., 2024).

5. Compression, Robustness, and Fairness

Compression can impact adversarial robustness, fairness, and per-class accuracy. Robust-optimized frameworks such as ATMC demonstrate that simultaneous adversarial training and compression yields substantially better size–robustness trade-offs than sequential or greedy approaches, regaining $>90\%$ adversarial accuracy at $16\times$ compression in ResNet34/CIFAR-10 (Gui et al., 2019).

Fairness concerns are domain-dependent: pruning may disproportionately harm under-represented subgroups unless sensitivity is controlled groupwise; quantization and QAT are generally more equitable if cal/bration is data-balanced; KD can propagate or mitigate biases depending on teacher properties (Caldeira et al., 2024).

6. Future Directions and Open Problems

Key directions identified in contemporary research include:

Data-Driven and Learnable Quantization: End-to-end optimization of quantizer thresholds or nonuniform codebooks, learning bit-allocation per layer or per block (Liao et al., 17 Mar 2025, Sakthi et al., 2022).
Automated Mixed-Precision and Neural Architecture Search: NAS and RL for allocating sparsity/bitwidth under global resource–accuracy constraints (He et al., 2018, Kampeas et al., 2023).
Compression and Robustness Co-Optimization: Unified pipelines considering adversarial accuracy, not only clean metrics (Gui et al., 2019).
Hyperfunction and Implicit Compression: Dynamic-system–based parameter coding and its scaling to even larger, multimodal networks (Fan et al., 2024).
Fairness-Constrained Compression: Incorporation of equity metrics and bias mitigation into hyperparameter search and pruning/quantization criteria (Caldeira et al., 2024).
Plug-and-Play/Zero-Retraining Methods: Train-free mechanisms (e.g., hyperfunction coding, rotation-invariant quantization, blockwise clustering with blockwise calibration) that deliver high compression with negligible or no accuracy loss and minimal pipeline friction (Kampeas et al., 2023, Fan et al., 2024, Liao et al., 17 Mar 2025).

7. Representative Summary Table

Technique/Class	Typical Compression Ratio	Retraining Needed?	Comments/Scope
Unstructured Pruning	2–30×	Yes	Magnitude-, Hessian-, or sensitivity-based (Carreira-Perpiñán, 2017)
Structured Pruning	2–10×	Yes	Hardware friendliness; filter/channel; may drop accuracy
Uniform Quantization	4–32×	Light/fine-tune	Per-layer or mixed precision; post-/pre-training
Teacher–Student KD	1–10×	Yes	Output/feature matching; can improve student accuracy
Low-Rank Factorization	2–10×	Partial	SVD/Tucker/Tensor; best for large FC/conv matrices
Codebook/Clustering	8–40×	Blockwise	Product, k-means, blockwise clustering and recovery (Liao et al., 17 Mar 2025)
Hashing/SMH	4–20×	Yes	Entire model in low-rank factorizations; global control
Hyperfunction Coding	3–30×	No	Dynamic system code; retraining-free plug-in (Fan et al., 2024)
Robust ADMM (ATMC)	4–20×	Yes (joint)	Unified robust and compressed \ min–max (Gui et al., 2019)

Margin selection, sensitivity budgeting, and post-compression assessment should be guided by empirical trade-off curves and, where available, new composite quality metrics that jointly score accuracy, size, and resource use (Ishtiaq et al., 2021).

Compression is thus a highly active area at the intersection of optimization, information theory, algorithm engineering, and deployment-aware modeling. Its methodologies anchor the practical translation of modern ML research into robust, efficient, and fair systems.