Papers
Topics
Authors
Recent
2000 character limit reached

Probabilistic Magnitude Pruning

Updated 15 December 2025
  • Probabilistic Magnitude Pruning (PMP) is a neural network compression method that prunes low-magnitude weights using statistical calibration and probabilistic modeling.
  • It integrates magnitude-based pruning with rigorous uncertainty quantification to guarantee expressive power and controlled predictive risk.
  • PMP has practical applications across fully connected, convolutional, and graph neural networks, enabling efficient deployment in computer vision and recognition tasks.

Probabilistic Magnitude Pruning (PMP) encompasses a family of methodologies for neural network compression that systematically prune low-magnitude weights with guarantees on network expressive power, generalization performance, and, in recent advances, calibrated uncertainty on performance loss under finite data. Approaches to PMP unify magnitude-based pruning—removal of connections with smallest absolute values—with probabilistic modeling, variational optimization, and statistical calibration to deliver well-controlled tradeoffs between sparsity and predictive risk. PMP therefore stands as a cornerstone in uncertainty-aware neural network deployment, featuring applications from fully connected architectures, convolutional networks, and graph neural networks to computer vision and skeleton-based recognition.

1. Formal Definitions and Problem Setting

Let f:X[0,1]M×Nf: \mathcal{X} \to [0,1]^{M \times N} denote a pre-trained neural network with KK real-valued weights W={wi}i=1KW = \{w_i\}_{i=1}^K. PMP seeks to produce a sparse variant fλf_\lambda by zeroing out the fraction λ[0,1)\lambda \in [0,1) of weights with smallest absolute value, defined by the quantile threshold qλ:=quantile(wi;λ)q_\lambda := \text{quantile}(|w_i|; \lambda) and the rule

wi,λ={wi,if wi>qλ 0,otherwisew_{i,\lambda} = \begin{cases} w_i, & \text{if } |w_i| > q_\lambda \ 0, & \text{otherwise} \end{cases}

A loss function \ell measuring degradation, such as (Y,Y^)=I{YY^}\ell(Y, \hat{Y}) = \mathbb{I}\{Y \neq \hat{Y}\} for classification, induces the “risk” R(λ)=E(X,Y)P[()]R(\lambda) = \mathbb{E}_{(X,Y) \sim P}[\ell(\cdot)] and empirical calibration risk R^n(λ)\widehat{R}_n(\lambda) over nn i.i.d. calibration samples.

The fundamental objective is: for given tolerance α(0,1)\alpha \in (0,1) and error budget δ(0,1)\delta \in (0,1), select the largest λ\lambda such that

P(R(λ)α)1δ,\mathbb{P}\left(R(\lambda) \leq \alpha\right) \geq 1-\delta,

which ensures with high confidence that the pruned network remains within tolerable risk (Alvarez, 8 Aug 2024).

2. Statistical Calibration and Distribution-Free Guarantees

PMP methods in (Alvarez, 8 Aug 2024) leverage distribution-free uncertainty quantification via the Learn–then–Test (LTT) framework. The range Λ=[0,1)\Lambda = [0,1) is discretized into QQ candidate sparsity levels Λ~={λj=j/Q}j=0Q1\tilde{\Lambda} = \{\lambda_j = j/Q\}_{j=0}^{Q-1}. For each λj\lambda_j, the null hypothesis H0,j:R(λj)>αH_{0,j}: R(\lambda_j) > \alpha is tested with super-uniform p-values:

  • Binomial-tail: pj=PBBinom(n,α)[BnR^n(λj)]p_j = \mathbb{P}_{B \sim \text{Binom}(n, \alpha)}[ B \leq n \widehat{R}_n(\lambda_j) ]
  • Hoeffding–Bentkus: pj=min{ePB[BnR^n(λj)],exp[nh(min{R^n,α},α)]}p_j = \min\left\{ e \cdot \mathbb{P}_{B}[ B \leq \lceil n \widehat{R}_n(\lambda_j) \rceil ], \exp[ -n h(\min\{\widehat{R}_n,\alpha\}, \alpha) ] \right\}
  • PRW p-values for general bounded loss

A family-wise error rate (FWER) controlling procedure A(p0,,pQ1;δ)\mathcal{A}(p_0, \dots, p_{Q-1}; \delta) rejects nulls at a subset ΓΛ~\Gamma \subset \tilde{\Lambda}. Selecting λ=maxΓ\lambda^* = \max \Gamma, one achieves

P(R(λ)α)1δ,\mathbb{P}(R(\lambda^*) \leq \alpha) \geq 1-\delta,

with no distributional assumptions beyond i.i.d. calibration draws (Theorem 3.1 of (Alvarez, 8 Aug 2024)). Monotonicity of risk in magnitude pruning supports fixed-sequence testing, terminating upon the first non-rejection.

3. Algorithmic Procedures and Variational PMP

PMP is instantiated by several algorithmic paradigms:

a. Calibrated Magnitude Pruning (Fixed-sequence Testing) (Alvarez, 8 Aug 2024):

  1. Sort wi|w_i| to determine qλjq_{\lambda_j}.
  2. Sequentially prune to build fλjf_{\lambda_j}, compute R^j\widehat{R}_j, and corresponding p-value pjp_j.
  3. Accumulate λj\lambda_j in Γ\Gamma until pj>δp_j > \delta.
  4. Output fλf_{\lambda^*}, where λ=maxΓ\lambda^* = \max \Gamma.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
\begin{algorithm}[H]
\caption{Probabilistic Magnitude Pruning (fixed‐sequence)}
\begin{algorithmic}[1]
\Require Calibration data %%%%34%%%%, full network %%%%35%%%%, loss %%%%36%%%%, grid %%%%37%%%%, tolerance %%%%38%%%%, error‐budget %%%%39%%%%
\State Sort %%%%40%%%% to obtain thresholds %%%%41%%%%
\State %%%%42%%%%
\For{%%%%43%%%%}
    \State Build %%%%44%%%% by zeroing all %%%%45%%%%
    \State %%%%46%%%%
    \State Compute %%%%47%%%% for %%%%48%%%%
    \If{%%%%49%%%%}
        \State %%%%50%%%%
    \Else
        \State \textbf{break}
    \EndIf
\EndFor
\State \Return %%%%51%%%% and pruned network %%%%52%%%%
\end{algorithmic}
\end{algorithm}

b. Variational PMP in GCNs (Sahbi, 2023):

  • Introduces a continuous “band-stop” parameterization: ω=ω^ψa,σ(ω^),ψa,σ(w)=11+σexp(a2w2)\omega^\ell = \hat{\omega}^\ell \odot \psi_{a,\sigma}(\hat{\omega}^\ell), \quad \psi_{a,\sigma}(w) = \frac{1}{1 + \sigma \exp(a^2 - w^2)} KL divergence enforces the empirical latent weight distribution to align with a prior PP, achieving an exact pruning budget via quantile mapping a=FP1(r)a = F_P^{-1}(r).
  • Loss function: L=Le+λDKL(PQ)\mathcal{L} = \mathcal{L}_e + \lambda D_{\mathrm{KL}}(P || Q) where Le\mathcal{L}_e is prediction loss.
  • End-to-end joint optimization of masks and network weights obviates explicit hard-masking or retraining steps.

4. Theoretical Guarantees and Generalization Bounds

PMP delivers quantifiable theoretical assurances:

  • In fully connected and convolutional networks, magnitude-based pruning of O(Dk1α)O(D_k^{1-\alpha}) weights per layer preserves uniform approximation error ϵ\epsilon with probability 1δ\geq 1-\delta provided sufficient width dkd_k, as established in (Qian et al., 2021). Layer widths must satisfy polynomial lower bounds in 1/ϵ1/\epsilon, 1/δ1/\delta, and 1/α1/\alpha to guarantee

Pr[supx21f(x)F(x)2ϵ]1δ\Pr\left[ \sup_{\|x\|_2 \leq 1} \|f(x) - F(x)\|_2 \leq \epsilon \right] \geq 1-\delta

  • In stochastic mask optimization (Hayou et al., 2021), minimization of empirical Gibbs risk induces data-adaptive L1L_1 regularization and preferential retention of weights best aligned to label features. Extensions to PAC-Bayes pruning provide explicit self-bounded generalization error via data-dependent priors and joint optimization of weights and stochastic mask parameters.

5. Computational Complexity and Implementation Considerations

Key operations for PMP include:

  • Sorting KK weights: O(KlogK)O(K \log K)
  • Forward passes for empirical risk: O(Qncostforward)O(Q n \cdot \text{cost}_{\text{forward}}), often reduced by incremental masking
  • Per-candidate p-value calculation: O(1)O(1)

Best practices highlighted:

  • Use sparse tensor representations post-pruning for accelerated inference
  • Precompute and cache layer-wise masks for repeated risk evaluation
  • Parallelize batch risk computation

Variational PMP settings (e.g., for GCNs) select smoothing parameters (σ\sigma) for band-stop functions and histogram binning (KK) to suit desired exact budget attainment. Training runs are efficiently supported on commodity GPUs (Sahbi, 2023).

6. Experimental Evaluations and Practical Tradeoffs

Experiments outline the efficacy of PMP:

Dataset & Architecture Calibration/Test Split Baseline(s) α-tolerance, δ-budget Achieved Sparsity Notes
MNIST / FCN (118k params) 9k calibration / 1k test Naive MP, calibrated MP α=0.03, δ=0.1 λ*=0.68–0.78 PMP respects α, naive method violates
PolypGen / U-Net (13.8M) 465 calibration / 50 test images Global MP α=0.05, δ=0.05 λ*=0.06 Significantly lower compression required
FPHA / GCN 575 test sequences Classical MP, PMP+Gaussian Fixed r r=55%–99% PMP+Gaussian outperforms MP at high sparsity

PMP achieves coverage at least 1δ1-\delta empirically, with selective calibration strategies enabling further control over confidence thresholds and abstention rates in prediction. Results indicate higher robustness of PMP with Laplace priors at extreme sparsities and improved generalization under Gaussian priors even without pruning (Sahbi, 2023).

7. Limitations and Extensions

PMP in its strongest form assumes monotonic risk increase with sparsity under one-shot magnitude pruning. Iterative or structured schemes may require more general multiple testing corrections (e.g., Holm–Bonferroni). PMP requires a held-out calibration set, rendering the method sensitive to calibration error in data-scarce regimes; bootstrap or cross-conformal variants offer mitigation at increased computation. Proposed extensions include:

  • Joint calibration of quantile λ\lambda and confidence threshold τ\tau
  • Adapting PMP to iterative or structured pruning settings
  • Incorporating second-order relevance metrics (e.g., Hessian-based scores)
  • Layer-wise, architecture-aware tuning of α\alpha, δ\delta budgets

Practical selection of α\alpha should reflect domain-specific tolerable drops, typically in [0.01,0.05][0.01, 0.05] for vision classification. Empirical evidence confirms that the sparsity parameter λ\lambda^* is more sensitive to α\alpha than to δ\delta (Alvarez, 8 Aug 2024).


Probabilistic Magnitude Pruning thus merges rigorous statistical calibration, variational and probabilistic modeling of sparsity, and practical algorithmic strategies to reliably compress deep neural networks under explicit risk control and budget constraints (Alvarez, 8 Aug 2024, Sahbi, 2023, Qian et al., 2021, Hayou et al., 2021).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Probabilistic Magnitude Pruning (PMP).