Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pruning & Quantization in Neural Networks

Updated 9 March 2026
  • Parameter pruning and quantization are key compression techniques that reduce model size by removing redundant parameters and lowering numerical precision.
  • These methods improve efficiency by decreasing memory usage and computational demands, enabling faster and more energy-efficient model deployment.
  • Advanced strategies combine differentiable end-to-end optimization, Bayesian methods, and hardware-aware planning to balance accuracy with resource savings.

Parameter pruning and quantization are two foundational approaches for neural network model compression, targeting reductions in memory footprint, computational cost, and energy consumption while preserving task accuracy. Pruning removes redundant or less important weights or structural elements, whereas quantization replaces high-precision numerical representations with lower-bit equivalents or discrete codebooks. These techniques can be applied independently or jointly, and recent research has focused on their integrated use, theoretical guarantees, principled trade-offs, optimization schemes, and their specific implications for both classical and modern deep architectures.

1. Fundamental Concepts and Techniques

Pruning in neural networks refers to the removal of parameters (weights, neurons, filters, attention heads, or channels) deemed unnecessary for the task. Classic magnitude-based thresholding prunes weights whose absolute values are below a layer- or element-specific threshold τ\tau, often discovered via sensitivity scans or as part of an iterative retrain-prune loop. Structured pruning removes entire channels or filters, while unstructured pruning operates at the fine-grained weight level (Paupamah et al., 2020, Guerra et al., 2020, Zhou et al., 2024).

Quantization replaces floating-point weights and/or activations with discrete values (e.g., 8-bit integers, 2-bit values, powers of two). Quantizers can be uniform, nonuniform, scalar, vector, or even power-of-two (to exploit shift-add hardware) (Ardakani et al., 2022, Zeinali et al., 28 Jan 2026, Makenali et al., 4 Sep 2025). Quantization-aware training (QAT) integrates quantization effects during training via straight-through estimators, while post-training quantization applies discretization after floating-point training has converged.

Modern frameworks increasingly blend pruning and quantization, leveraging their complementary benefits. Joint schemes include differentiable formulations in which both sparsity (pruning mask or regularizer) and quantization parameters (bit-widths, codebooks, step sizes) are learned or optimized end-to-end with task loss (Wang et al., 2020, Wenshøj et al., 15 Dec 2025, Zandonati et al., 2023).

2. Joint Pruning and Quantization: Algorithmic Architectures

Recent works highlight several principled methods for combining pruning and quantization.

  • Analytic (One-Shot) Approaches: "OPQ" (Hu et al., 2022) analytically solves for optimal per-layer pruning thresholds and quantization steps directly from full-precision pretrained weights, assuming parametric weight distributions (e.g., Laplacian). No iterative mask/codebook search is needed; all decisions are made pre-finetuning. This approach enables rapid deployment and efficient resource allocation across layers.
  • Differentiable End-to-End Compression: "CoDeQ" (Wenshøj et al., 15 Dec 2025) parameterizes the dead-zone width of a scalar quantizer, directly mapping low-magnitude values to zero (thus implementing pruning). Both the dead-zone and quantization parameters are learned via backpropagation. This eliminates the need for outer-loop hyperparameter search and supports both fixed and mixed-precision regimes.
  • Bayesian and Information-Geometric Schemes: "QPruner" (Zhou et al., 2024) and "FITCompress" (Zandonati et al., 2023) employ Bayesian optimization and Fisher Information Trace metrics, respectively, to identify layer-wise pruning and quantization allocations under memory or BOP constraints. These methods optimize a task-specific objective subject to compression budgets by exploring the Pareto front.
  • Variational Pruning and Mixed-Bit Quantization: "DJPQ" (Wang et al., 2020) frames the compression process as a joint gradient-based optimization problem, incorporating a variational information bottleneck for structured pruning and a parameterized quantization complexity regularizer (e.g., bit-operations per layer).
  • Stochastic Path-Following and Theoretical Error Guarantees: Unified stochastic frameworks generalize post-training quantization and pruning as a stochastic path-following problem, providing rigorous O(logN)O(\sqrt{\log N}) error bounds for both operations and their composition (Zhang et al., 2024, Li et al., 2020).

These approaches diverge in aspects such as granularity (unstructured vs. structured), optimization targets (fixed vs. mixed-precision), and regularization (explicit 0\ell_0, information bottleneck, or dead-zone penalties).

3. Empirical Trade-offs and Performance Metrics

The application of pruning and quantization yields significant compression and acceleration, but the resulting accuracy-resource trade-offs are architecture-dependent. Table 1 summarizes exemplar results from the literature:

Method Architecture Compression Ratio Bit-width Top-1 Drop Reference
OPQ ResNet-50 (ImageNet) 38× 3.25b –0.40% (Hu et al., 2022)
CoDeQ ResNet-18 (ImageNet) 20–23× 4b/mixed –0.5% (Wenshøj et al., 15 Dec 2025)
DJPQ ResNet-18 (ImageNet) 53× (BOP) mixed –0.47% (Wang et al., 2020)
CompSRT SwinIR-light 9.4× 4b + 40% –0.04dB* (Zeinali et al., 28 Jan 2026)
QPruner LLaMA-7B ~30% (mem) 4/8b +6pp** (Zhou et al., 2024)

* For PSNR in super resolution; ** Over previous baseline in benchmark accuracy.

Key findings include:

4. Theoretical Analysis and Error Guarantees

Theoretical advances address the worst-case and average-case error incurred by pruning and quantization.

  • Worst-Case SDP Certificates: Robust semi-definite programming approaches yield network- and input-dependent upper bounds on output error as a quadratic form in inputs, parameterized by network depth, activation slopes, quantizer step, and pruning pattern (Li et al., 2020). These are practically useful for certifying models in safety-critical settings.
  • Strong Lottery Ticket Hypothesis (SLTH): SLTH theory extends to finite-precision networks, showing that with sufficient overparameterization (width O(dlog1δ)O(d\log \frac{1}{\delta}), where δ\delta is target bit precision), any target quantized network can—in principle—be realized exactly by pruning a larger quantized network, with no further gradient-based tuning required (Kumar et al., 14 Aug 2025).
  • Stochastic Path-Following: For post-training compression, the error between the output of the original and pruned-quantized model can be bounded polylogarithmically in model size, with the scaling constants determined by quantization range, sparsity, and an error-correction scaling factor (Zhang et al., 2024).

Together, these results formalize the trade-offs between resource reduction and representation error for modern compression schemes.

5. Optimization Schedules, Non-Commutativity, and Best Practices

Empirical studies indicate that the order and integration of pruning and quantization profoundly affect performance. In "Training Deep Neural Networks with Joint Quantization and Pruning," the "Non-Commutativity Hypothesis" is formulated, stating that the ordering of the introduction of pruning and quantization (prune-then-quantize vs. quantize-then-prune) leads to different local minima and thus nonidentical trade-offs (Zhang et al., 2021).

  • For discriminative tasks (classification, detection), prune-then-quantize generally yields the best accuracy per memory footprint.
  • For generative tasks (super-resolution, GANs), quantize-then-prune is superior.
  • Jointly optimizing pruning thresholds and quantizer step-sizes, preferably in an end-to-end differentiable loop, consistently achieves better compression-accuracy Pareto trade-offs than pipelined/sequential schemes (Wenshøj et al., 15 Dec 2025, Zandonati et al., 2023).

Critical practices include:

  • Calibration and sensitivity scanning for per-layer pruning thresholds (Paupamah et al., 2020).
  • Use of adaptive, data-driven quantization intervals—such as standard deviation-based scaling and dynamic clipping (Ardakani et al., 2022).
  • Bayesian or Fisher-geometric planning for layer-wise allocation in large, diverse networks (Zandonati et al., 2023, Zhou et al., 2024).
  • For high-sparsity regimes or extremely low-precision (e.g., binarization), bias correction and special handling of pathological cases (e.g., batchnorm foldout, zero-variance channels) are necessary for stability (Nishikawa et al., 2020).

6. Hardware, Energy, and Deployment Implications

Model compression via pruning and quantization has direct implications for system deployment:

7. Open Challenges and Future Directions

  • Adaptive, architecture-aware allocation of pruning and quantization (e.g., in ViTs, transformers, and LLMs) remains a topic of ongoing research (Zhou et al., 2024, Zheng et al., 25 Jan 2026).
  • Theoretical questions include the development of tighter error bounds, characterization of the optimality of path-planning heuristics, and closed-form solutions for structured pruning in quantized regimes (Kumar et al., 14 Aug 2025, Zhang et al., 2024).
  • Extensions to activation pruning/quantization, dynamic runtime compression, task adaptation (PEFT/LoRA), and deployment under strict latency, accuracy, or fairness constraints are ongoing (Zhou et al., 2024, Schaefer et al., 2023).
  • Practical challenges remain in bridging the gap between idealized sparsity or bit-width reductions and realized speedups on real hardware, due to limitations of memory, bandwidth, and software support (Hacene et al., 2018, Schaefer et al., 2023).

In summary, parameter pruning and quantization, when designed and tuned jointly, offer highly effective neural network compression while maintaining competitive performance. This field continues to advance along both theoretical and engineering axes, yielding frameworks that approach fundamental efficiency limits without substantial accuracy compromise (Wenshøj et al., 15 Dec 2025, Zandonati et al., 2023, Hu et al., 2022, Zhou et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parameter Pruning and Quantization.