Structured Pruning of MLPs
- Structured pruning of MLPs is a method that removes entire neurons, rows/columns, or blocks to systematically compress network architectures.
- It leverages strategies like submodular optimization, gradient-based saliency, and group regularization to balance significant compression with minimal performance degradation.
- Empirical studies show that structured pruning achieves major reductions in FLOPs, memory, and inference latency, making models more hardware efficient for scalable deployment.
Structured pruning of multilayer perceptrons (MLPs) is a systematic approach to reducing network complexity by removing entire neurons (units), rows/columns of weight matrices, or contiguous blocks in a way that preserves dense submatrices and is hardware-efficient. Unlike unstructured pruning, which zeroes individual weights and often leads to irregular sparsity patterns, structured pruning yields models with clear architectural reductions—e.g., thinner or shorter dense layers—improving memory footprint, inference latency, and sometimes generalization. The design, theoretical justification, and empirical behavior of structured-pruned MLPs are active topics in network compression and scalable deployment, with state-of-the-art methods leveraging optimization, submodular theory, gradient- or activation-based heuristics, and regularized training.
1. Formalism and Objectives of Structured Pruning
Structured pruning removes entire groups of parameters within an MLP: typically full neurons (rows/columns of weight matrices), contiguous “blocks,” or patterns inducing row- and/or column-wise sparsity. Formally, given an m-layer MLP
structured pruning targets subsets at each hidden layer , retaining only the neurons indexed by and physically excising the corresponding rows and columns of and . This can be succinctly described by gate matrices
so that removes the rows or columns indexed by (Cheairi et al., 6 Dec 2025).
Pruning is performed while balancing three (often competing) desiderata:
- Maximize compression: Remove as many neurons or blocks as possible.
- Minimize performance loss: Bound the change in output, loss, or task accuracy.
- Meet hardware or implementation constraints: Ensure the remaining architecture is dense (no irregular sparsity) for fast inference.
This balance distinguishes structured pruning from both unstructured pruning and architecture search.
2. Algorithmic Approaches to Structured MLP Pruning
Structured pruning methodologies can be organized into several broad strategies:
Greedy and Submodular Optimization
Data-efficient post hoc pruning can be cast as a subset selection problem. For a given layer with neurons, the optimal -neuron subset minimizes the input perturbation to the next layer: where is the activation matrix and zeroes columns not in (Halabi et al., 2022). This objective is weakly submodular, admitting an efficient greedy maximization with provable approximation guarantees: Closed-form least squares reweighting on the surviving activations yields optimal next-layer weights. The output error is bounded to decay exponentially in (neurons kept), under mild assumptions.
Gradient-based Saliency and Pre-training Pruning
Pre-training or “single-shot” methods (e.g., SNIP/3SP) rank units by their effect on the change in loss under a first-order Taylor approximation. Letting , where is a binary mask and is compute cost, the lowest-scoring units can be excised before training begins (Amersfoort et al., 2020). This approach ensures that the remaining MLP is both computationally efficient and quick to train, with typical accuracy loss for aggressive (50%) pruning.
Structured Regularization during Training
Penalizing group norms (e.g., group Lasso) on columns or rows directly induces neuron-level sparsity: followed by thresholding near-zero columns/rows to prune weight blocks (Hubens et al., 2023, Cacciola et al., 2023). In these approaches, group-norm minimization is scheduled to yield a desired global sparsity. Iterative thresholding and fine-tuning restore accuracy post-pruning.
Enhancements such as perspective regularization (SPR) and structured-pattern regularizers further promote block-based removal and co-clustering of rows and columns, leading to highly regular patterns and improved hardware utilization (Cacciola et al., 2023, Park et al., 2021).
Probabilistic and Information-Theoretic Approaches
Dynamic Probabilistic Pruning (DPP) leverages trainable, differentiable -out-of- sampling per output neuron, using Gumbel-softmax relaxation and entropy regularization to enforce hardware-friendly masks: Backpropagation is performed through a relaxed soft-top- mask, and mask entropy and diversity are tracked during training (Gonzalez-Carabarin et al., 2021).
Variational methods employ dropout-style unit-wise Bernoulli random variables with learnable parameters and hyper-priors to stochastically drive unit probabilities to zero or one. Theoretical analysis provides convergence to deterministic, pruned subnetworks (Guenter et al., 2022).
Functional Network Preservation and Structural Dropout
Recent methods for large-scale transformers leverage statistical dependency structure or treat the MLP layer as a “digital brain,” performing independent component analysis (ICA) across activation time series to reveal functional networks. Only neurons critical to identified subnetworks are preserved, with all others pruned en bloc (Liu et al., 7 Aug 2025). Structural Dropout (SD) instead randomly truncates hidden state channels during training, inducing a natural importance ordering such that pruned models for any width attains near-optimal performance with a contiguous set of the most informative neurons (Knodt, 2022).
3. Theoretical Guarantees and Compressibility of Wide MLPs
Compression bounds for structured pruning in wide MLPs have received substantial attention. Under broad conditions (Lipschitz activations, bounded weight norms), randomized greedy pruning can remove a fixed fraction of neurons per wide layer, preserving average squared error up to where is depth (Cheairi et al., 6 Dec 2025). This theoretical framework generalizes Optimal Brain Damage (OBD) by canceling first-order loss terms in expectation and controlling higher-order remainders via Lindeberg interpolation.
Pruning flexibility versus width is formalized: highly overparameterized layers (with large ) admit high compressibility; bottleneck and narrow layers constrain pruning. These analyses inform architectural design for target sparsity.
4. Empirical Behavior, Hardware, and Implementation
Practical studies repeatedly demonstrate that structured pruning can achieve dramatic reductions in MLP parameter counts, FLOPs, and memory footprint—often on the order of –—while incurring sub-percent accuracy loss on benchmarks such as MNIST, CIFAR-10, and LLMs (Halabi et al., 2022, Amersfoort et al., 2020, Hubens et al., 2023, Zhu et al., 10 Jun 2025, Liu et al., 7 Aug 2025). Aggressive pruning ratios above 75% typically yield diminishing accuracy returns, marking the accepted compression frontier.
Key empirical observations include:
- Closed-form weight reweighting after pruning (via least-squares or layer-wise scaling) ameliorates feature-map mismatches and further boosts downstream task performance (Halabi et al., 2022, Nova et al., 2023).
- Fine-tuning after structured pruning is almost always beneficial, especially at high compression—restoring most degradation in “one-shot” pruned settings (Halabi et al., 2022, Zhu et al., 10 Jun 2025).
- Hardware benefits accrue only if whole rows/columns (neurons) are removed; block-structured layouts (fixed per group) simplify deployment (Gonzalez-Carabarin et al., 2021).
- Advanced regularization (patterning or group-sparsity) further enhances hardware-friendliness by concentrating surviving weights in contiguous blocks, optimizing for BLAS engines and dense accelerators (Park et al., 2021, Liu et al., 7 Aug 2025).
5. Specialized Structured Pruning in Transformer MLPs and LLMs
Transformer-style MLP submodules (feed-forward networks/FFNs) are a major focus for structured pruning in large-scale models due to their high parameter density. Key specialized advances include:
- Gradient-free structured pruning using representative and activation-driven rankings, without labels or gradients, efficiently scaling to BERT and DistilBERT (Nova et al., 2023).
- Self-distillation MLP pruning (SDMPrune): Taylor-based importance ranking with a distillation-based loss, focusing on MLPs (rather than attention heads) to yield superior zero-shot accuracy/perplexity trade-offs on LLaMA, ChatGLM, and Vicuna models. Pruning to width yields $20$– parameter and speed reduction with acceptably small drop in performance (Zhu et al., 10 Jun 2025).
- Functional network preservation via ICA: for each layer, functional subnetworks are identified by statistical independence across neuron activations, and pruning is restricted to retain at least one representative from each such module, achieving performance and perplexity advantages over magnitude-based and depth-pruning baselines (Liu et al., 7 Aug 2025).
6. Applications, Limitations, and Open Problems
Structured pruning of MLPs is established as a leading technique for:
- Model compression for edge and resource-constrained inference.
- Constraint learning and MIP embedding (dramatic reduction in MIP solver time and resource usage) (Cacciola et al., 2023).
- Optimizing training and inference latency in very large neural networks.
Key limitations and ongoing challenges include:
- First-order approximation artifacts in ranking-based methods; potential approximation errors at extreme pruning or in architectures with pronounced inter-unit dependencies (Amersfoort et al., 2020, Halabi et al., 2022).
- Layerwise sparsity patterns remain an open area of paper, especially in conjunction with automatic mixed precision or quantization.
- Iterative (multi-shot) pruning versus one-shot: “greedy” or “simultaneous” schemes may exhibit differing local optima, and their practical utility depends on dataset, architecture, and target compression (Park et al., 2021, Halabi et al., 2022).
Future work is expected to unify submodular and information-theoretic analyses further, automate hyperparameter selection for regularization-based methods, and more deeply integrate pruning with architectural search and neural reparameterization.
References
- "Data-Efficient Structured Pruning via Submodular Optimization" (Halabi et al., 2022)
- "Single Shot Structured Pruning Before Training" (Amersfoort et al., 2020)
- "Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities" (Gonzalez-Carabarin et al., 2021)
- "Induced Feature Selection by Structured Pruning" (Hubens et al., 2023)
- "Theoretical Compression Bounds for Wide Multilayer Perceptrons" (Cheairi et al., 6 Dec 2025)
- "Structured Pattern Pruning Using Regularization" (Park et al., 2021)
- "Structured Pruning of Neural Networks for Constraints Learning" (Cacciola et al., 2023)
- "Robust Learning of Parsimonious Deep Neural Networks" (Guenter et al., 2022)
- "Structural Dropout for Model Width Compression" (Knodt, 2022)
- "Gradient-Free Structured Pruning with Unlabeled Data" (Nova et al., 2023)
- "SDMPrune: Self-Distillation MLP Pruning for Efficient LLMs" (Zhu et al., 10 Jun 2025)
- "Pruning LLMs by Identifying and Preserving Functional Networks" (Liu et al., 7 Aug 2025)