Contribution-Based Pruning in Neural Networks
- Contribution-based pruning is a technique that quantifies each component’s impact on a model’s loss function, enabling precise model sparsification.
- It employs methods such as Taylor expansion, Shapley value attribution, and mutual information to assess the importance of weights, neurons, or filters.
- This approach supports efficient deployment in resource-constrained settings by preserving accuracy while outperforming simpler heuristic pruning methods.
Contribution-based pruning is a paradigm in the compression and acceleration of neural networks and probabilistic models in which model parameters, structures, or features are pruned according to quantitative estimates of their individual or group contribution to predictive task performance or objective function value. In contrast to simple magnitude, randomness, or heuristic approaches, contribution-based pruning seeks to directly measure the impact of removing a component—weight, neuron, filter, parent set, or embedding—on critical loss functions, generalization, or practical downstream criteria. This approach has achieved prominence in both deep learning and Bayesian structure learning, yielding methods that demonstrate superior retention of accuracy at high sparsity, adaptability to low-data or transfer learning regimes, and transparent, theoretically justified attribution of importance.
1. Foundational Principles and Mathematical Criteria
Contribution-based pruning methods quantify the saliency or importance of parameters by estimating the change in a cost function induced by their removal, masking, or perturbation. Fundamental to this paradigm are Taylor-expansion–based criteria, Shapley value attributions, sensitivity gradients, and mutual information:
- Taylor Expansion Criteria: The cost increase induced by pruning a parameter can be approximated by a first-order Taylor expansion around the current network state. The saliency is
where is the cost (loss) and the output or activation to prune (Molchanov et al., 2016). For multi-dimensional units (e.g., feature maps), averages or accumulations of this value are used.
- Shapley Value Attribution: For structured pruning (neurons, filters, embeddings), the marginal contribution of a unit is formalized as its Shapley value:
where is the set of prunable units and is the model loss or performance metric on units (Ancona et al., 2020, Tran et al., 20 Nov 2024, Sun et al., 3 May 2025).
- Gradient Sensitivity and Mask Gradients: Methods such as SNIP (Lee et al., 2018) introduce binary masks at initialization and measure the absolute gradient of loss with respect to each mask. The normalized score quantifies the sensitivity of the loss to each connection.
- Mutual Information: Layerwise pruning can also be formulated by maximizing the mutual information between the activations of a layer and those of the next layer:
This metric favors retaining units that provide maximal predictive content to the subsequent layer (Fan et al., 2021).
Contribution estimation can further incorporate operator norms (measuring a filter's maximal output effect) (Singh et al., 2023), influence functions (loss derivatives after perturbation) (Lai et al., 2021), or channel independence metrics (the unique explanatory power of a filter given other filters) (Sui et al., 2021).
2. Comparison to Classical and Heuristic Pruning Criteria
Contribution-based pruning methods differ fundamentally from classical magnitude-based or random pruning by aligning the pruning process with actual or surrogate changes in global objectives:
Criterion | What Is Measured | Data-Dependence | Captures Interactions | Computation |
---|---|---|---|---|
Weight Norm | No | No | Low | |
Activation-Based | Mean/variance of activations | Yes | No | Low |
Taylor (1st/2nd) | , Hessian terms | Yes | No (1st), Partial (2nd) | Moderate |
Shapley Value | Marginal contribution over all subsets | Yes | Yes | High (approx.) |
Channel Independence | Nuclear norm drop when removing | Yes | Inter-channel | Moderate |
Influence Function | Loss gradient for small masks | Yes | Yes (batch-robust) | Moderate |
Mutual Information | Entropy reduction by inclusion | Yes | Inter-layer | Moderate |
Contribution-based methods (Taylor, Shapley, Influence, MI, Independence) consistently outperform magnitude or activation-based approaches in scenarios where preserving structure and accuracy under high sparsity is required, especially when retraining is limited (Molchanov et al., 2016, Ancona et al., 2020, Lai et al., 2021, Fan et al., 2021, Sui et al., 2021, Singh et al., 2023). Notably, Shapley value and Jacobian-based criteria uniquely satisfy “null player,” symmetry, and efficiency properties, properly attributing importance even in the presence of interactions and redundancy.
3. Application Domains and Experimental Validation
Contribution-based pruning has demonstrated effectiveness across several neural and probabilistic models:
- Convolutional Neural Networks: Taylor expansion-based policies yield superior pruning–accuracy tradeoffs over or activation sparsity approaches, supporting theoretical 10 or practical 5 accelerations on transfer learning tasks and ImageNet-scale networks (Molchanov et al., 2016).
- Structured Pruning with Shapley Value: Removal of filters or neurons ranked by Shapley value produces smallest performance degradation, with graceful loss curves, especially in low-data or retrain-limited regimes (Ancona et al., 2020). Embedding pruning using Shapley values enables single-shot, retrain-free reductions for recommender systems, utilizing a field-aware codebook to mitigate information loss (Tran et al., 20 Nov 2024).
- Transformer Block and Layer Pruning: Strategies such as MultiPruner combine block, channel, and head pruning guided by blockwise contribution, and Shapley Value–based non-uniform pruning assigns tailored sparsity levels to each Transformer layer based on per-layer SV, improving perplexity and average task accuracy on LLaMA, OPT, and Baichuan2 (Muñoz et al., 17 Jan 2025, Sun et al., 3 May 2025).
- Autoencoders and Evolutionary Contexts: Activation-guided mutation operators (variance, boolean logical conjunction) demonstrate initial superiority in canonical training, whereas purely random pruning is sometimes favored in coevolutionary frameworks with large population diversity (Jorgensen et al., 8 May 2025).
- Score-Based Graphical Models: Bayesian network structure learning benefits from contribution-based pruning of parent sets using tight, theoretically sound upper bounds on BDeu (and related) scores: only candidate sets whose best-case contribution could surpass the best subset are retained, enabling orders-of-magnitude reductions in search space (Correia et al., 2019).
Experimental metrics consistently show strong Spearman correlations ( in (Molchanov et al., 2019)) between contribution-based criteria and oracle “true” importance. Accuracy reductions at fixed sparsity are consistently less severe than in weight/magnitude-based baselines (Molchanov et al., 2016, Ancona et al., 2020, Tran et al., 20 Nov 2024, Sun et al., 3 May 2025, Muñoz et al., 17 Jan 2025).
4. Theoretical Guarantees, Implementation, and Computational Considerations
Contribution-based pruning supplies plausible theoretical guarantees such as:
- Satisfaction of fairness axioms for SHAP-pruning (null player, symmetry, efficiency, linearity) (Ancona et al., 2020, Tran et al., 20 Nov 2024, Sun et al., 3 May 2025), providing interpretability and optimality in contribution attribution.
- First-order accuracy for Taylor and influence-based metrics: gradient information suffices because, after training, product terms like activationgradient retain informative variance even if individual gradients are small (Molchanov et al., 2016, Molchanov et al., 2019, Lai et al., 2021).
- Blockwise and inter-layer dependence capture via the Jacobian Criterion, which directly quantifies intra-component and inter-layer correlations, surpassing diagonal or per-parameter criteria (i.e., Fisher, Taylor, norm) (Chen et al., 7 Aug 2025). Formally, the empirical loss’s squared change is approximated by .
- Global versus local estimation tradeoffs: Exact Shapley value or tight mutual information computations scale exponentially, but efficient approximations (permutation sampling (Ancona et al., 2020), sliding windows (Sun et al., 3 May 2025), field grouping (Tran et al., 20 Nov 2024)) yield tractable implementations with controlled bias.
Most approaches interleave pruning with fine-tuning and may employ special mechanisms to preserve lost capacity during pruned weight recovery (e.g., the Equivalent Pruning autoencoder mechanism (Chen et al., 7 Aug 2025)). In a practical setting, regularization (e.g., via L1-activation penalties (Hussien et al., 21 Oct 2024)) or staged rewinding (Zhao et al., 2023) can facilitate precise, contribution-driven sparsification.
5. Impact, Limitations, and Application Scenarios
Contribution-based pruning is particularly impactful in the following regimes:
- Transfer Learning and Specialization: Pruning large, pre-trained backbones adapted to specialized datasets enables up to 10 reductions in floating point operations with minor accuracy loss (Molchanov et al., 2016).
- Hardware-Efficient Deployment: Structured pruning (entire filters, heads, layers) guided by contribution metrics maintains dense matrix structure, enabling practical acceleration on CPUs, GPUs, and specialized edge hardware (Zhao et al., 2023, Sui et al., 2021).
- Resource-Constrained and On-Device Scenarios: Methods that support single-shot, retraining-free or dynamic pruning (e.g., with Shapley-based grouping in recommender systems) facilitate rapid adaptation without repeated heavy computation (Tran et al., 20 Nov 2024).
- Low-Data or No-Refine Contexts: When fine-tuning is infeasible, theoretically principled metrics (especially Shapley value or blockwise Jacobian) result in the most graceful accuracy degradation relative to random, magnitude, or supervised gradient heuristics (Ancona et al., 2020, Chen et al., 7 Aug 2025).
- Explainability and Attribution: Input-specific or per-layer contribution-based pruning provides fine-grained attribution maps for interpretability or debugging (Khakzar et al., 2019).
However, limitations persist in computational cost for exact SV/MI computation, dependency on accurate importance estimation (necessitating appropriate accumulation or averaging (Molchanov et al., 2019, Lai et al., 2021)), and the trade-off between aggressive sparsity and risk of over-pruning—especially when interactions or redundant features dominate.
6. Prospective Directions and Synergies
Recent work identifies several avenues for advancing contribution-based pruning:
- Integration with Quantization and Other Compression: Combining contribution-based sparsification with quantization or low-rank decomposition for further computational reductions (Yue et al., 2019, Hussien et al., 21 Oct 2024).
- Benefits of Retained Contribution during Fine-Tuning: Approaches that use autoencoder-like mechanisms to temporarily preserve the information flow from pruned parameters during recovery outperform instantaneous zeroing (Chen et al., 7 Aug 2025).
- Dynamic and Adaptive Scheduling: Adaptive pruning, retraining, or threshold selection ensures robust performance across tasks and architectures (Zhao et al., 2023).
- Extension to Attention, Embeddings, and Non-Convex Models: Contribution-guided pruning is being actively researched in transformers (MLP heads/blocks (Muñoz et al., 17 Jan 2025)), attention maps (Zhao et al., 2023), and probabilistic graphical models (Correia et al., 2019).
- Theoretical Analysis of Contribution Estimators: Improved estimation for mutual information and influence in deep, highly non-linear (or overparameterized) models remains an open area (Hussien et al., 21 Oct 2024, Fan et al., 2021).
This evolution continues to position contribution-based pruning as an increasingly rigorous and effective approach in the optimization and deployment of complex machine learning models for both academic and industrial applications.