One-Shot Global Pruning Strategy
- One-Shot Global Pruning is a method that removes a fraction of network parameters globally in a single step based on importance scores.
- It utilizes diverse criteria—magnitude, first-order, and second-order—to assess parameter importance and allocate sparsity adaptively across layers.
- The strategy offers computational efficiency with minimal retraining by incorporating recovery techniques like fine-tuning and BatchNorm recalibration.
A one-shot global pruning strategy refers to the systematic removal of a fraction of a deep neural network’s parameters or structures (such as weights, filters, or channels) in a single global step—using a data- or model-informed global criterion—followed by an optional recovery phase (such as fine-tuning or reconstruction). Unlike traditional iterative layer-wise pruning, one-shot global approaches make all pruning decisions simultaneously based on global importance scores, achieving substantial reductions in model size or computation with minimal retraining and often smaller accuracy degradation at moderate sparsity levels. This paradigm is supported by a diverse body of work, spanning criteria design, optimization frameworks, recovery techniques, and empirical performance benchmarking (Wang et al., 2019, Li et al., 2019, Janusz et al., 19 Aug 2025, Lucas et al., 27 Nov 2024).
1. Formal Definition and Problem Setting
In one-shot global pruning, the central goal is to obtain a compact, accelerated version of a (typically pretrained) deep neural network by eliminating a user-specified fraction of the overall parameters, channels, or FLOPs while preserving task accuracy. Formally, given a network with parameters (weights, filters, or other structural units), one aims to retain the top $1-s$ fraction according to a global importance score , where is the target sparsity (Janusz et al., 19 Aug 2025). The threshold is computed across the entire network such that exactly scores are below , and a global binary mask is applied in one step to induce sparsity.
The global nature of the threshold means that pruning does not enforce fixed per-layer ratios a priori; instead, the relative importance of parameters across all layers is considered, potentially yielding adaptive, data-driven layer-wise prune rates (Wang et al., 2019).
2. Pruning Criteria and Score Computation
One-shot global strategies rely on a range of scoring mechanisms to assign importance:
- Magnitude-based: (or / norm for filters/channels), widely used for computational efficiency (Li et al., 2019, Janusz et al., 19 Aug 2025).
- First-order Taylor: (weight times gradient of the loss), reflecting sensitivity (Janusz et al., 19 Aug 2025).
- Second-order (Hessian/OBS): , where is the Hessian diagonal, robust but more expensive (Lucas et al., 27 Nov 2024).
- Learned importance: Auxiliary parameters (e.g., scaling vectors) optimized via an auxiliary sparsity-regularized loss to encode global significance (Wang et al., 2019).
- Data- and task-aware enhancements: Use of discriminative data patches, knowledge distillation, or cross-lingual activation statistics for multilingual models to bias pruning toward functionally important subnets (Yang et al., 2022, Alim et al., 20 Nov 2025, Choenni et al., 27 May 2025).
A single scoring run, usually on a pretrained or well-initialized model, is performed prior to pruning; a universal threshold is determined such that the desired global sparsity is achieved.
3. Global Thresholding, Allocation, and Pruning Procedures
Following score computation, the global threshold enforcing the target sparsity is selected. For unstructured schemes, this is via quantile selection across all parameters (Janusz et al., 19 Aug 2025). For (semi-)structured strategies (e.g., filter/channel, block-wise, or group-wise sparsity), thresholding is applied collectively to group scores, or via group-level importance measures (Li et al., 2019, Chen et al., 2021, Lim et al., 6 Feb 2025).
The pruning operation is typically executed as a mask operation (setting ) or by explicitly removing filters/groups from the architecture (Wang et al., 2019).
Adaptive per-layer pruning rates arise naturally from global thresholding: layers with less globally important parameters lose more channels or filters, while more critical layers are retained. Some approaches, such as ADMM-based methods or sensitivity-guided pruning, enforce further constraints to distribute sparsity either strictly (fixed per-layer budgets) or adaptively according to sensitivity metrics (Li et al., 2019, Irigoyen et al., 11 Nov 2025).
4. Post-Pruning Recovery and Fine-Tuning
While some methods achieve acceptable accuracy with zero retraining (e.g., when using zero-invariant parameter groups or sensitivity-informed allocations (Chen et al., 2021, Irigoyen et al., 11 Nov 2025)), most one-shot global pruning frameworks incorporate a brief recovery phase. This can include:
- Standard fine-tuning: Short cross-entropy retraining with mask enforcement; early stopping or patience-based retraining is strongly advocated for robust recovery (Janusz et al., 19 Aug 2025).
- Global or layer-wise reconstruction: Specialized objectives targeting the restoration of intermediate representations (e.g., KL- or JS-divergence in critical layers) or global nonlinear reconstruction using Hessian-free Newton updates (Wang et al., 2019, Lucas et al., 27 Nov 2024).
- BatchNorm recalibration: REFLOW's strategy of updating post-pruning activation statistics yields substantial recovery at high sparsity, countering signal collapse (Saikumar et al., 18 Feb 2025).
- Knowledge distillation: Teacher-guided loss signals for both importance ranking and retraining, especially at extreme sparsities (Alim et al., 20 Nov 2025).
- No fine-tuning: Certain structured regimes (e.g., Only-Train-Once, sensitivity-aware allocations) guarantee functional equivalence and require no retraining (Chen et al., 2021, Irigoyen et al., 11 Nov 2025).
5. Theoretical and Practical Motivations for the Global Strategy
Empirical and theoretical analyses suggest distinct advantages for the one-shot global paradigm:
- Captures cross-layer dependencies: A global score and threshold incorporate cross-layer interactions, avoiding suboptimal local decisions that can impair downstream information flow (Wang et al., 2019, Lucas et al., 27 Nov 2024).
- Efficiency: Requires only a single importance computation and single prune-recover cycle, dramatically reducing wall-time and computational resource requirements compared to multi-step iterative strategies (Janusz et al., 19 Aug 2025, Li et al., 2019).
- Sparsity/accuracy trade-off: Global strategies achieve minimal loss up to moderate sparsities (∼80%) and are competitive or superior at lower ratios; at very high sparsity (≥90%), iterative schemes tend to better preserve accuracy (Janusz et al., 19 Aug 2025, Alim et al., 20 Nov 2025).
- Flexibility: Compatible with a range of architectures and granularities, including convolutional networks, transformers, BERT-like LMs, and diffusion models, as long as global importance can be estimated (Lucas et al., 27 Nov 2024, Lim et al., 6 Feb 2025, Zhu et al., 8 Oct 2025).
6. Comparative Empirical Results
The superiority and broad applicability of global one-shot pruning is substantiated across diverse model families and benchmarks:
| Model | Baseline Acc. | Sparsity/Speedup | Pruning Method | Pruned Acc. (Δ) | Reference |
|---|---|---|---|---|---|
| VGG-16/ImageNet | Top-5: 90.38% | 4.4× (77.3% FLOPs) | One-shot global (Wang et al., 2019) | 88.84% (–1.54%) | (Wang et al., 2019) |
| ResNet-50 | Top-5: 92.9% | 2.8× (64.6% FLOPs) | One-shot global (Wang et al., 2019) | 91.64% (–1.23%) | (Wang et al., 2019) |
| BERT-Base | - | ∼5× param. | PGB one-shot semi-structured | Δ<2% | (Lim et al., 6 Feb 2025) |
| Whisper-Small | WER: 11.64% | 40.8% sparsity | Sensitivity-aware one-shot | 11.84% (+0.20) | (Irigoyen et al., 11 Nov 2025) |
| ResNeXt-101 | Top-1: 79% | 80% sparse | REFLOW-BN recalc | 78.9% (>+75ppt) | (Saikumar et al., 18 Feb 2025) |
| ViT-L/16 | Top-1: 84.2% | 2:4 N:M | SNOWS (global Newton rec.) | 77.2% | (Lucas et al., 27 Nov 2024) |
At moderate sparsity targets, one-shot global consistently closes most of the gap to the unpruned baseline; at deeper compressions accuracy can remain strong with advanced recovery (e.g., Newton/Hessian-based, KD-augmented, or BatchNorm recalibration). In LMs, simulation studies confirm the importance of global allocation and language-adaptive scores for maintaining multilinguality (see M-Wanda (Choenni et al., 27 May 2025)).
7. Limitations, Caveats, and Best Practices
- Sparsity regime: While one-shot global pruning excels up to ∼80% sparsity, its reliability decreases in ultra-high sparsity scenarios, where gradual iterative procedures or structured recovery can excel (Janusz et al., 19 Aug 2025).
- Criterion selection: Magnitude-based scores offer speed but may underperform second-order or data-aware scores for extremer compressions; the latter, however, incur higher upfront computation (Lucas et al., 27 Nov 2024).
- Recovery phase: Lack of fine-tuning or inadequate BN/statistics recalibration can lead to catastrophic signal collapse (Saikumar et al., 18 Feb 2025).
- Allocation mechanisms: When layerwise sensitivity is highly nonuniform, explicit sensitivity-aware or correlation-weighted allocations yield best outcomes (Irigoyen et al., 11 Nov 2025, Choenni et al., 27 May 2025).
- Calibration data: For data-aware pruning or adaptive BN recalibration, the choice and size of the calibration set significantly affect accuracy (Saikumar et al., 18 Feb 2025, Choenni et al., 27 May 2025).
- Hyperparameter tuning: Penalty weights for regularization (e.g., λ for sparsity/VAR), patience for recovery, and adaptive thresholds for group sparsity should be validated per architecture/benchmark (Yun et al., 18 Nov 2025, Janusz et al., 19 Aug 2025).
In summary, one-shot global pruning embodies a paradigm shift in neural network compression away from slow, layerwise, or iterative regimes, by leveraging principled global scoring, adaptive resource allocation, and efficient recovery. Its computational efficiency and broad applicability make it a reference point for scalable, practical model compression (Wang et al., 2019, Janusz et al., 19 Aug 2025, Lucas et al., 27 Nov 2024).