Probabilistic Magnitude Pruning

Updated 15 December 2025

Probabilistic Magnitude Pruning (PMP) is a neural network compression method that prunes low-magnitude weights using statistical calibration and probabilistic modeling.
It integrates magnitude-based pruning with rigorous uncertainty quantification to guarantee expressive power and controlled predictive risk.
PMP has practical applications across fully connected, convolutional, and graph neural networks, enabling efficient deployment in computer vision and recognition tasks.

Probabilistic Magnitude Pruning (PMP) encompasses a family of methodologies for neural network compression that systematically prune low-magnitude weights with guarantees on network expressive power, generalization performance, and, in recent advances, calibrated uncertainty on performance loss under finite data. Approaches to PMP unify magnitude-based pruning—removal of connections with smallest absolute values—with probabilistic modeling, variational optimization, and statistical calibration to deliver well-controlled tradeoffs between sparsity and predictive risk. PMP therefore stands as a cornerstone in uncertainty-aware neural network deployment, featuring applications from fully connected architectures, convolutional networks, and graph neural networks to computer vision and skeleton-based recognition.

1. Formal Definitions and Problem Setting

Let $f: \mathcal{X} \to [0,1]^{M \times N}$ denote a pre-trained neural network with $K$ real-valued weights $W = \{w_i\}_{i=1}^K$ . PMP seeks to produce a sparse variant $f_\lambda$ by zeroing out the fraction $\lambda \in [0,1)$ of weights with smallest absolute value, defined by the quantile threshold $q_\lambda := \text{quantile}(|w_i|; \lambda)$ and the rule

$w_{i,\lambda} = \begin{cases} w_i, & \text{if } |w_i| > q_\lambda \ 0, & \text{otherwise} \end{cases}$

A loss function $\ell$ measuring degradation, such as $\ell(Y, \hat{Y}) = \mathbb{I}\{Y \neq \hat{Y}\}$ for classification, induces the “risk” $R(\lambda) = \mathbb{E}_{(X,Y) \sim P}[\ell(\cdot)]$ and empirical calibration risk $\widehat{R}_n(\lambda)$ over $n$ i.i.d. calibration samples.

The fundamental objective is: for given tolerance $\alpha \in (0,1)$ and error budget $\delta \in (0,1)$ , select the largest $\lambda$ such that

$\mathbb{P}\left(R(\lambda) \leq \alpha\right) \geq 1-\delta,$

which ensures with high confidence that the pruned network remains within tolerable risk (Alvarez, 8 Aug 2024).

2. Statistical Calibration and Distribution-Free Guarantees

PMP methods in (Alvarez, 8 Aug 2024) leverage distribution-free uncertainty quantification via the Learn–then–Test (LTT) framework. The range $\Lambda = [0,1)$ is discretized into $Q$ candidate sparsity levels $\tilde{\Lambda} = \{\lambda_j = j/Q\}_{j=0}^{Q-1}$ . For each $\lambda_j$ , the null hypothesis $H_{0,j}: R(\lambda_j) > \alpha$ is tested with super-uniform p-values:

Binomial-tail: $p_j = \mathbb{P}_{B \sim \text{Binom}(n, \alpha)}[ B \leq n \widehat{R}_n(\lambda_j) ]$
Hoeffding–Bentkus: $p_j = \min\left\{ e \cdot \mathbb{P}_{B}[ B \leq \lceil n \widehat{R}_n(\lambda_j) \rceil ], \exp[ -n h(\min\{\widehat{R}_n,\alpha\}, \alpha) ] \right\}$
PRW p-values for general bounded loss

A family-wise error rate (FWER) controlling procedure $\mathcal{A}(p_0, \dots, p_{Q-1}; \delta)$ rejects nulls at a subset $\Gamma \subset \tilde{\Lambda}$ . Selecting $\lambda^* = \max \Gamma$ , one achieves

$\mathbb{P}(R(\lambda^*) \leq \alpha) \geq 1-\delta,$

with no distributional assumptions beyond i.i.d. calibration draws (Theorem 3.1 of (Alvarez, 8 Aug 2024)). Monotonicity of risk in magnitude pruning supports fixed-sequence testing, terminating upon the first non-rejection.

3. Algorithmic Procedures and Variational PMP

PMP is instantiated by several algorithmic paradigms:

a. Calibrated Magnitude Pruning (Fixed-sequence Testing) (Alvarez, 8 Aug 2024):

Sort $|w_i|$ to determine $q_{\lambda_j}$ .
Sequentially prune to build $f_{\lambda_j}$ , compute $\widehat{R}_j$ , and corresponding p-value $p_j$ .
Accumulate $\lambda_j$ in $\Gamma$ until $p_j > \delta$ .
Output $f_{\lambda^*}$ , where $\lambda^* = \max \Gamma$ .

\begin{algorithm}[H]
\caption{Probabilistic Magnitude Pruning (fixed‐sequence)}
\begin{algorithmic}[1]
\Require Calibration data %%%%34%%%%, full network %%%%35%%%%, loss %%%%36%%%%, grid %%%%37%%%%, tolerance %%%%38%%%%, error‐budget %%%%39%%%%
\State Sort %%%%40%%%% to obtain thresholds %%%%41%%%%
\State %%%%42%%%%
\For{%%%%43%%%%}
    \State Build %%%%44%%%% by zeroing all %%%%45%%%%
    \State %%%%46%%%%
    \State Compute %%%%47%%%% for %%%%48%%%%
    \If{%%%%49%%%%}
        \State %%%%50%%%%
    \Else
        \State \textbf{break}
    \EndIf
\EndFor
\State \Return %%%%51%%%% and pruned network %%%%52%%%%
\end{algorithmic}
\end{algorithm}

b. Variational PMP in GCNs (Sahbi, 2023):

Introduces a continuous “band-stop” parameterization: $\omega^\ell = \hat{\omega}^\ell \odot \psi_{a,\sigma}(\hat{\omega}^\ell), \quad \psi_{a,\sigma}(w) = \frac{1}{1 + \sigma \exp(a^2 - w^2)}$ KL divergence enforces the empirical latent weight distribution to align with a prior $P$ , achieving an exact pruning budget via quantile mapping $a = F_P^{-1}(r)$ .
Loss function: $\mathcal{L} = \mathcal{L}_e + \lambda D_{\mathrm{KL}}(P || Q)$ where $\mathcal{L}_e$ is prediction loss.
End-to-end joint optimization of masks and network weights obviates explicit hard-masking or retraining steps.

4. Theoretical Guarantees and Generalization Bounds

PMP delivers quantifiable theoretical assurances:

In fully connected and convolutional networks, magnitude-based pruning of $O(D_k^{1-\alpha})$ weights per layer preserves uniform approximation error $\epsilon$ with probability $\geq 1-\delta$ provided sufficient width $d_k$ , as established in (Qian et al., 2021). Layer widths must satisfy polynomial lower bounds in $1/\epsilon$ , $1/\delta$ , and $1/\alpha$ to guarantee

$\Pr\left[ \sup_{\|x\|_2 \leq 1} \|f(x) - F(x)\|_2 \leq \epsilon \right] \geq 1-\delta$

In stochastic mask optimization (Hayou et al., 2021), minimization of empirical Gibbs risk induces data-adaptive $L_1$ regularization and preferential retention of weights best aligned to label features. Extensions to PAC-Bayes pruning provide explicit self-bounded generalization error via data-dependent priors and joint optimization of weights and stochastic mask parameters.

5. Computational Complexity and Implementation Considerations

Key operations for PMP include:

Sorting $K$ weights: $O(K \log K)$
Forward passes for empirical risk: $O(Q n \cdot \text{cost}_{\text{forward}})$ , often reduced by incremental masking
Per-candidate p-value calculation: $O(1)$

Best practices highlighted:

Use sparse tensor representations post-pruning for accelerated inference
Precompute and cache layer-wise masks for repeated risk evaluation
Parallelize batch risk computation

Variational PMP settings (e.g., for GCNs) select smoothing parameters ( $\sigma$ ) for band-stop functions and histogram binning ( $K$ ) to suit desired exact budget attainment. Training runs are efficiently supported on commodity GPUs (Sahbi, 2023).

6. Experimental Evaluations and Practical Tradeoffs

Experiments outline the efficacy of PMP:

Dataset & Architecture	Calibration/Test Split	Baseline(s)	α-tolerance, δ-budget	Achieved Sparsity	Notes
MNIST / FCN (118k params)	9k calibration / 1k test	Naive MP, calibrated MP	α=0.03, δ=0.1	λ*=0.68–0.78	PMP respects α, naive method violates
PolypGen / U-Net (13.8M)	465 calibration / 50 test images	Global MP	α=0.05, δ=0.05	λ*=0.06	Significantly lower compression required
FPHA / GCN	575 test sequences	Classical MP, PMP+Gaussian	Fixed r	r=55%–99%	PMP+Gaussian outperforms MP at high sparsity

PMP achieves coverage at least $1-\delta$ empirically, with selective calibration strategies enabling further control over confidence thresholds and abstention rates in prediction. Results indicate higher robustness of PMP with Laplace priors at extreme sparsities and improved generalization under Gaussian priors even without pruning (Sahbi, 2023).

7. Limitations and Extensions

PMP in its strongest form assumes monotonic risk increase with sparsity under one-shot magnitude pruning. Iterative or structured schemes may require more general multiple testing corrections (e.g., Holm–Bonferroni). PMP requires a held-out calibration set, rendering the method sensitive to calibration error in data-scarce regimes; bootstrap or cross-conformal variants offer mitigation at increased computation. Proposed extensions include:

Joint calibration of quantile $\lambda$ and confidence threshold $\tau$
Adapting PMP to iterative or structured pruning settings
Incorporating second-order relevance metrics (e.g., Hessian-based scores)
Layer-wise, architecture-aware tuning of $\alpha$ , $\delta$ budgets

Practical selection of $\alpha$ should reflect domain-specific tolerable drops, typically in $[0.01, 0.05]$ for vision classification. Empirical evidence confirms that the sparsity parameter $\lambda^*$ is more sensitive to $\alpha$ than to $\delta$ (Alvarez, 8 Aug 2024).

Probabilistic Magnitude Pruning thus merges rigorous statistical calibration, variational and probabilistic modeling of sparsity, and practical algorithmic strategies to reliably compress deep neural networks under explicit risk control and budget constraints (Alvarez, 8 Aug 2024, Sahbi, 2023, Qian et al., 2021, Hayou et al., 2021).