Probabilistic Magnitude Pruning
- Probabilistic Magnitude Pruning (PMP) is a neural network compression method that prunes low-magnitude weights using statistical calibration and probabilistic modeling.
- It integrates magnitude-based pruning with rigorous uncertainty quantification to guarantee expressive power and controlled predictive risk.
- PMP has practical applications across fully connected, convolutional, and graph neural networks, enabling efficient deployment in computer vision and recognition tasks.
Probabilistic Magnitude Pruning (PMP) encompasses a family of methodologies for neural network compression that systematically prune low-magnitude weights with guarantees on network expressive power, generalization performance, and, in recent advances, calibrated uncertainty on performance loss under finite data. Approaches to PMP unify magnitude-based pruning—removal of connections with smallest absolute values—with probabilistic modeling, variational optimization, and statistical calibration to deliver well-controlled tradeoffs between sparsity and predictive risk. PMP therefore stands as a cornerstone in uncertainty-aware neural network deployment, featuring applications from fully connected architectures, convolutional networks, and graph neural networks to computer vision and skeleton-based recognition.
1. Formal Definitions and Problem Setting
Let denote a pre-trained neural network with real-valued weights . PMP seeks to produce a sparse variant by zeroing out the fraction of weights with smallest absolute value, defined by the quantile threshold and the rule
A loss function measuring degradation, such as for classification, induces the “risk” and empirical calibration risk over i.i.d. calibration samples.
The fundamental objective is: for given tolerance and error budget , select the largest such that
which ensures with high confidence that the pruned network remains within tolerable risk (Alvarez, 8 Aug 2024).
2. Statistical Calibration and Distribution-Free Guarantees
PMP methods in (Alvarez, 8 Aug 2024) leverage distribution-free uncertainty quantification via the Learn–then–Test (LTT) framework. The range is discretized into candidate sparsity levels . For each , the null hypothesis is tested with super-uniform p-values:
- Binomial-tail:
- Hoeffding–Bentkus:
- PRW p-values for general bounded loss
A family-wise error rate (FWER) controlling procedure rejects nulls at a subset . Selecting , one achieves
with no distributional assumptions beyond i.i.d. calibration draws (Theorem 3.1 of (Alvarez, 8 Aug 2024)). Monotonicity of risk in magnitude pruning supports fixed-sequence testing, terminating upon the first non-rejection.
3. Algorithmic Procedures and Variational PMP
PMP is instantiated by several algorithmic paradigms:
a. Calibrated Magnitude Pruning (Fixed-sequence Testing) (Alvarez, 8 Aug 2024):
- Sort to determine .
- Sequentially prune to build , compute , and corresponding p-value .
- Accumulate in until .
- Output , where .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
\begin{algorithm}[H]
\caption{Probabilistic Magnitude Pruning (fixed‐sequence)}
\begin{algorithmic}[1]
\Require Calibration data %%%%34%%%%, full network %%%%35%%%%, loss %%%%36%%%%, grid %%%%37%%%%, tolerance %%%%38%%%%, error‐budget %%%%39%%%%
\State Sort %%%%40%%%% to obtain thresholds %%%%41%%%%
\State %%%%42%%%%
\For{%%%%43%%%%}
\State Build %%%%44%%%% by zeroing all %%%%45%%%%
\State %%%%46%%%%
\State Compute %%%%47%%%% for %%%%48%%%%
\If{%%%%49%%%%}
\State %%%%50%%%%
\Else
\State \textbf{break}
\EndIf
\EndFor
\State \Return %%%%51%%%% and pruned network %%%%52%%%%
\end{algorithmic}
\end{algorithm} |
b. Variational PMP in GCNs (Sahbi, 2023):
- Introduces a continuous “band-stop” parameterization: KL divergence enforces the empirical latent weight distribution to align with a prior , achieving an exact pruning budget via quantile mapping .
- Loss function: where is prediction loss.
- End-to-end joint optimization of masks and network weights obviates explicit hard-masking or retraining steps.
4. Theoretical Guarantees and Generalization Bounds
PMP delivers quantifiable theoretical assurances:
- In fully connected and convolutional networks, magnitude-based pruning of weights per layer preserves uniform approximation error with probability provided sufficient width , as established in (Qian et al., 2021). Layer widths must satisfy polynomial lower bounds in , , and to guarantee
- In stochastic mask optimization (Hayou et al., 2021), minimization of empirical Gibbs risk induces data-adaptive regularization and preferential retention of weights best aligned to label features. Extensions to PAC-Bayes pruning provide explicit self-bounded generalization error via data-dependent priors and joint optimization of weights and stochastic mask parameters.
5. Computational Complexity and Implementation Considerations
Key operations for PMP include:
- Sorting weights:
- Forward passes for empirical risk: , often reduced by incremental masking
- Per-candidate p-value calculation:
Best practices highlighted:
- Use sparse tensor representations post-pruning for accelerated inference
- Precompute and cache layer-wise masks for repeated risk evaluation
- Parallelize batch risk computation
Variational PMP settings (e.g., for GCNs) select smoothing parameters () for band-stop functions and histogram binning () to suit desired exact budget attainment. Training runs are efficiently supported on commodity GPUs (Sahbi, 2023).
6. Experimental Evaluations and Practical Tradeoffs
Experiments outline the efficacy of PMP:
| Dataset & Architecture | Calibration/Test Split | Baseline(s) | α-tolerance, δ-budget | Achieved Sparsity | Notes |
|---|---|---|---|---|---|
| MNIST / FCN (118k params) | 9k calibration / 1k test | Naive MP, calibrated MP | α=0.03, δ=0.1 | λ*=0.68–0.78 | PMP respects α, naive method violates |
| PolypGen / U-Net (13.8M) | 465 calibration / 50 test images | Global MP | α=0.05, δ=0.05 | λ*=0.06 | Significantly lower compression required |
| FPHA / GCN | 575 test sequences | Classical MP, PMP+Gaussian | Fixed r | r=55%–99% | PMP+Gaussian outperforms MP at high sparsity |
PMP achieves coverage at least empirically, with selective calibration strategies enabling further control over confidence thresholds and abstention rates in prediction. Results indicate higher robustness of PMP with Laplace priors at extreme sparsities and improved generalization under Gaussian priors even without pruning (Sahbi, 2023).
7. Limitations and Extensions
PMP in its strongest form assumes monotonic risk increase with sparsity under one-shot magnitude pruning. Iterative or structured schemes may require more general multiple testing corrections (e.g., Holm–Bonferroni). PMP requires a held-out calibration set, rendering the method sensitive to calibration error in data-scarce regimes; bootstrap or cross-conformal variants offer mitigation at increased computation. Proposed extensions include:
- Joint calibration of quantile and confidence threshold
- Adapting PMP to iterative or structured pruning settings
- Incorporating second-order relevance metrics (e.g., Hessian-based scores)
- Layer-wise, architecture-aware tuning of , budgets
Practical selection of should reflect domain-specific tolerable drops, typically in for vision classification. Empirical evidence confirms that the sparsity parameter is more sensitive to than to (Alvarez, 8 Aug 2024).
Probabilistic Magnitude Pruning thus merges rigorous statistical calibration, variational and probabilistic modeling of sparsity, and practical algorithmic strategies to reliably compress deep neural networks under explicit risk control and budget constraints (Alvarez, 8 Aug 2024, Sahbi, 2023, Qian et al., 2021, Hayou et al., 2021).