Papers
Topics
Authors
Recent
2000 character limit reached

Structured Pruning of MLPs

Updated 14 December 2025
  • Structured pruning of MLPs is a method that removes entire neurons, rows/columns, or blocks to systematically compress network architectures.
  • It leverages strategies like submodular optimization, gradient-based saliency, and group regularization to balance significant compression with minimal performance degradation.
  • Empirical studies show that structured pruning achieves major reductions in FLOPs, memory, and inference latency, making models more hardware efficient for scalable deployment.

Structured pruning of multilayer perceptrons (MLPs) is a systematic approach to reducing network complexity by removing entire neurons (units), rows/columns of weight matrices, or contiguous blocks in a way that preserves dense submatrices and is hardware-efficient. Unlike unstructured pruning, which zeroes individual weights and often leads to irregular sparsity patterns, structured pruning yields models with clear architectural reductions—e.g., thinner or shorter dense layers—improving memory footprint, inference latency, and sometimes generalization. The design, theoretical justification, and empirical behavior of structured-pruned MLPs are active topics in network compression and scalable deployment, with state-of-the-art methods leveraging optimization, submodular theory, gradient- or activation-based heuristics, and regularized training.

1. Formalism and Objectives of Structured Pruning

Structured pruning removes entire groups of parameters within an MLP: typically full neurons (rows/columns of weight matrices), contiguous “blocks,” or patterns inducing row- and/or column-wise sparsity. Formally, given an m-layer MLP

Φ(x)=ϕm(Wmϕm1(ϕ1(W1x))),\Phi(x) = \phi_m(W_m \phi_{m-1}(\dots \phi_1(W_1x) \dots)),

structured pruning targets subsets S{1,,n}S_\ell\subseteq\{1,\dots,n_\ell\} at each hidden layer \ell, retaining only the neurons indexed by SS_\ell and physically excising the corresponding rows and columns of WW_\ell and W+1W_{\ell+1}. This can be succinctly described by gate matrices

G(z;S)=I+(z1)diag(1iS),G(z;S) = I + (z-1)\operatorname{diag}(1_{i\in S}),

so that WG(0;S)WW_\ell \to G(0;S_\ell) W_\ell removes the rows or columns indexed by SS_\ell (Cheairi et al., 6 Dec 2025).

Pruning is performed while balancing three (often competing) desiderata:

  • Maximize compression: Remove as many neurons or blocks as possible.
  • Minimize performance loss: Bound the change in output, loss, or task accuracy.
  • Meet hardware or implementation constraints: Ensure the remaining architecture is dense (no irregular sparsity) for fast inference.

This balance distinguishes structured pruning from both unstructured pruning and architecture search.

2. Algorithmic Approaches to Structured MLP Pruning

Structured pruning methodologies can be organized into several broad strategies:

Greedy and Submodular Optimization

Data-efficient post hoc pruning can be cast as a subset selection problem. For a given layer \ell with nn_\ell neurons, the optimal kk-neuron subset SS minimizes the input perturbation to the next layer: minS:Sk  minW~  AW+1ASW~F2,\min_{S:|S|\leq k}\;\min_{\tilde W}\;\|A^\ell W^{\ell+1}-A^\ell_S\tilde W\|_F^2, where AA^\ell is the activation matrix and ASA^\ell_S zeroes columns not in SS (Halabi et al., 2022). This objective F(S)F(S) is weakly submodular, admitting an efficient greedy maximization with provable approximation guarantees: F(S^)(1eγS^,k)maxSkF(S).F(\hat S) \geq (1-e^{-\gamma_{\hat S,k}})\max_{|S|\leq k}F(S). Closed-form least squares reweighting on the surviving activations yields optimal next-layer weights. The output error is bounded to decay exponentially in kk (neurons kept), under mild assumptions.

Gradient-based Saliency and Pre-training Pruning

Pre-training or “single-shot” methods (e.g., SNIP/3SP) rank units by their effect on the change in loss under a first-order Taylor approximation. Letting r,j=Lm,j/cˉr_{\ell,j}=|\frac{\partial L}{\partial m_{\ell,j}}|/\bar c_\ell, where m,jm_{\ell,j} is a binary mask and cˉ\bar c_\ell is compute cost, the lowest-scoring units can be excised before training begins (Amersfoort et al., 2020). This approach ensures that the remaining MLP is both computationally efficient and quick to train, with typical 1%\leq1\% accuracy loss for aggressive (50%) pruning.

Structured Regularization during Training

Penalizing group norms (e.g., group Lasso) on columns or rows directly induces neuron-level sparsity: Ltotal=Ltask+λi=1Lj=1mi+1Wi[:,j]2,\mathcal{L}_{\rm total} = \mathcal{L}_{\rm task} + \lambda \sum_{i=1}^L\sum_{j=1}^{m_{i+1}} \|W_i[:,j]\|_2, followed by thresholding near-zero columns/rows to prune weight blocks (Hubens et al., 2023, Cacciola et al., 2023). In these approaches, group-norm minimization is scheduled to yield a desired global sparsity. Iterative thresholding and fine-tuning restore accuracy post-pruning.

Enhancements such as perspective regularization (SPR) and structured-pattern regularizers further promote block-based removal and co-clustering of rows and columns, leading to highly regular patterns and improved hardware utilization (Cacciola et al., 2023, Park et al., 2021).

Probabilistic and Information-Theoretic Approaches

Dynamic Probabilistic Pruning (DPP) leverages trainable, differentiable kk-out-of-nn sampling per output neuron, using Gumbel-softmax relaxation and entropy regularization to enforce hardware-friendly masks: Mi,j:i=1nMi,j=k,W~i,j=Wi,jMi,j.M_{i,j}: \,\sum_{i=1}^n M_{i,j}=k,\quad \tilde W_{i,j}=W_{i,j}M_{i,j}. Backpropagation is performed through a relaxed soft-top-kk mask, and mask entropy and diversity are tracked during training (Gonzalez-Carabarin et al., 2021).

Variational methods employ dropout-style unit-wise Bernoulli random variables with learnable parameters and hyper-priors to stochastically drive unit probabilities to zero or one. Theoretical analysis provides convergence to deterministic, pruned subnetworks (Guenter et al., 2022).

Functional Network Preservation and Structural Dropout

Recent methods for large-scale transformers leverage statistical dependency structure or treat the MLP layer as a “digital brain,” performing independent component analysis (ICA) across activation time series to reveal functional networks. Only neurons critical to identified subnetworks are preserved, with all others pruned en bloc (Liu et al., 7 Aug 2025). Structural Dropout (SD) instead randomly truncates hidden state channels during training, inducing a natural importance ordering such that pruned models for any width kk attains near-optimal performance with a contiguous set of the most informative neurons (Knodt, 2022).

3. Theoretical Guarantees and Compressibility of Wide MLPs

Compression bounds for structured pruning in wide MLPs have received substantial attention. Under broad conditions (Lipschitz activations, bounded weight norms), randomized greedy pruning can remove a fixed fraction pp of neurons per wide layer, preserving average squared error up to O(σ2m(1+ξ)mξ)O(\sigma^{2m}(1+\xi)^m\xi) where mm is depth (Cheairi et al., 6 Dec 2025). This theoretical framework generalizes Optimal Brain Damage (OBD) by canceling first-order loss terms in expectation and controlling higher-order remainders via Lindeberg interpolation.

Pruning flexibility versus width is formalized: highly overparameterized layers (with large nn_\ell) admit high compressibility; bottleneck and narrow layers constrain pruning. These analyses inform architectural design for target sparsity.

4. Empirical Behavior, Hardware, and Implementation

Practical studies repeatedly demonstrate that structured pruning can achieve dramatic reductions in MLP parameter counts, FLOPs, and memory footprint—often on the order of 2×2\times10×10\times—while incurring sub-percent accuracy loss on benchmarks such as MNIST, CIFAR-10, and LLMs (Halabi et al., 2022, Amersfoort et al., 2020, Hubens et al., 2023, Zhu et al., 10 Jun 2025, Liu et al., 7 Aug 2025). Aggressive pruning ratios above 75% typically yield diminishing accuracy returns, marking the accepted compression frontier.

Key empirical observations include:

  • Closed-form weight reweighting after pruning (via least-squares or layer-wise scaling) ameliorates feature-map mismatches and further boosts downstream task performance (Halabi et al., 2022, Nova et al., 2023).
  • Fine-tuning after structured pruning is almost always beneficial, especially at high compression—restoring most degradation in “one-shot” pruned settings (Halabi et al., 2022, Zhu et al., 10 Jun 2025).
  • Hardware benefits accrue only if whole rows/columns (neurons) are removed; block-structured layouts (fixed kk per group) simplify deployment (Gonzalez-Carabarin et al., 2021).
  • Advanced regularization (patterning or group-sparsity) further enhances hardware-friendliness by concentrating surviving weights in contiguous blocks, optimizing for BLAS engines and dense accelerators (Park et al., 2021, Liu et al., 7 Aug 2025).

5. Specialized Structured Pruning in Transformer MLPs and LLMs

Transformer-style MLP submodules (feed-forward networks/FFNs) are a major focus for structured pruning in large-scale models due to their high parameter density. Key specialized advances include:

  • Gradient-free structured pruning using representative and activation-driven rankings, without labels or gradients, efficiently scaling to BERT and DistilBERT (Nova et al., 2023).
  • Self-distillation MLP pruning (SDMPrune): Taylor-based importance ranking with a distillation-based loss, focusing on MLPs (rather than attention heads) to yield superior zero-shot accuracy/perplexity trade-offs on LLaMA, ChatGLM, and Vicuna models. Pruning to 80%\sim80\% width yields $20$–30%30\% parameter and speed reduction with acceptably small drop in performance (Zhu et al., 10 Jun 2025).
  • Functional network preservation via ICA: for each layer, functional subnetworks are identified by statistical independence across neuron activations, and pruning is restricted to retain at least one representative from each such module, achieving performance and perplexity advantages over magnitude-based and depth-pruning baselines (Liu et al., 7 Aug 2025).

6. Applications, Limitations, and Open Problems

Structured pruning of MLPs is established as a leading technique for:

  • Model compression for edge and resource-constrained inference.
  • Constraint learning and MIP embedding (dramatic reduction in MIP solver time and resource usage) (Cacciola et al., 2023).
  • Optimizing training and inference latency in very large neural networks.

Key limitations and ongoing challenges include:

  • First-order approximation artifacts in ranking-based methods; potential approximation errors at extreme pruning or in architectures with pronounced inter-unit dependencies (Amersfoort et al., 2020, Halabi et al., 2022).
  • Layerwise sparsity patterns remain an open area of paper, especially in conjunction with automatic mixed precision or quantization.
  • Iterative (multi-shot) pruning versus one-shot: “greedy” or “simultaneous” schemes may exhibit differing local optima, and their practical utility depends on dataset, architecture, and target compression (Park et al., 2021, Halabi et al., 2022).

Future work is expected to unify submodular and information-theoretic analyses further, automate hyperparameter selection for regularization-based methods, and more deeply integrate pruning with architectural search and neural reparameterization.


References

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Structured Pruning of MLPs.