Greedy Pruning: Theory and Applications
- Greedy pruning is a method that iteratively selects or removes elements based on locally optimal, task-specific importance criteria.
- It is widely used in model compression, sparse recovery, and submodular maximization, with applications in deep neural network pruning and compressed sensing.
- Enhanced variants like tree pruning, beam search, and hardware-aware clustering improve efficiency while addressing the limitations of myopic, local decision-making.
Greedy pruning refers to a broad class of algorithms for model compression, subset selection, and sparse optimization that iteratively construct a reduced model or subset by making a sequence of locally optimal decisions. At each iteration, a greedy pruning method evaluates the available elements—such as weights, neurons, channels, heads, or operations—according to a task-specific importance criterion and eliminates, keeps, or selects the locally best choice, often without considering global optimality or future interactions. Greedy pruning methods are widely adopted due to their computational efficiency, ease of implementation, and strong empirical performance across deep learning, signal processing, submodular maximization, symbolic regression, and adversarial robustness contexts.
1. Greedy Pruning in Sparse Recovery and Combinatorial Optimization
Greedy methods have a rich history in sparse signal recovery, convex optimization, and submodular function maximization. In compressed sensing, a canonical formulation is the minimization of residual norm subject to a sparsity constraint: where greedy algorithms such as Orthogonal Matching Pursuit (OMP) build the support set iteratively by selecting at each step the index maximizing the correlation between the residual and the sensing matrix columns. Enhanced greedy techniques, such as Matching Pursuit with Tree Pruning (TMP), generalize this by maintaining multiple support candidates in a branching tree, “completing” partial supports via noncausal selection, and aggressively pruning branches whose residuals exceed a global threshold. This avoids classic greedy myopia while controlling complexity. Under restricted isometry property (RIP) conditions (e.g., ), exact sparse recovery can be guaranteed, and empirical results confirm high accuracy with dramatic reduction in search complexity (Lee et al., 2014).
In submodular maximization, the classical greedy algorithm proceeds by adding, at each step, the element with the largest marginal gain. Recent work extends this framework for arbitrary (possibly negative and non-monotone) submodular functions via greedy-pruning, where after each selection, any element whose marginal gain becomes non-positive is pruned from the active set. This allows a generalized, curvature-parameterized multiplicative approximation guarantee even in non-monotone, sign-indefinite regimes: where is the trajectory-restricted curvature and (Chen et al., 8 May 2026). Multilinear extension variants further generalize this approach to matroid and combinatorial constraints.
2. Algorithms and Methodological Variants
Greedy pruning manifests in both forward selection and backward elimination. Forward greedy selection—often termed Greedy Forward Selection (GFS)—starts from an empty subnetwork and iteratively adds the neuron or filter that results in the lowest loss when included, potentially with adaptive re-ranking at each step (Ye et al., 2020). Backward elimination starts from the full model and removes at each step the parameter, neuron, or channel whose elimination least increases the loss.
Structured model pruning in deep learning leverages greedy criteria to eliminate channels, feature maps, attention heads, or whole layers. Examples include:
- Taylor-expansion-based criteria in CNNs, where the first-order estimate
guides feature map/channel removal, with interleaved fine-tuning to mitigate accuracy loss. This approach outperforms norm-based and activation-based metrics in structured pruning, yielding strong correlation with exact loss impact and practical speedups even on hardware-constrained platforms (Molchanov et al., 2016).
- Hierarchical approaches in large-scale CNNs that combine layer-wise and global selection: e.g., hierarchical backward greedy search and matching pursuit (OMP)-based sparse approximations, with backward elimination using closed-form loss estimates for efficient filter elimination—substantially improving both compression rate and computational cost (Purohit et al., 2024).
- Cluster pruning, where groups of filters (clusters) are pruned together to align with hardware constraints such as SIMD/SIMT or VPUs. Groups are greedily formed by sorting according to aggregate importance scores and are removed in hardware-friendly multiples, yielding robust latency and throughput gains in edge-AI deployments (Gamanayake et al., 2020).
- Attention and operation pruning in transformers. Methods like Greedy-Gnorm dynamically rescore attention heads at each greedy step via gradient norms of Q/K/V projections, demonstrably outclassing static entropy-based metrics and leading to smooth accuracy-vs-pruning curves even at extreme compression (Guo et al., 4 Feb 2026). Layer-wise greedy pruning (e.g. Greedy-layer Pruning, GLP) repeatedly removes the single transformer layer whose elimination incurs the smallest fine-tuned validation loss drop (Peer et al., 2021).
- Symbolic networks for regression: greedy edge pruning with beam search evaluates, at each step, the edge whose removal least increases the (MSE + sparsity-penalty) loss, with occasional beam search to maintain alternative candidates and improve search diversity (Wu et al., 2024).
Table 1 summarizes several instantiations:
| Domain / Model | Greedy Criterion | Selection / Pruning Unit |
|---|---|---|
| Compressed sensing | Correlation with residual | Support index |
| CNNs | Taylor approx. of loss change | Channel / feature map / filter |
| Vision-LLMs | Marginal accuracy drop | Operation tuple (token, module, etc) |
| Transformers | Gradient L2 product | Attention head |
| Multitask CNNs | norm / task oracle | Channel (global / per-layer) |
| Symbolic networks | Loss-increase on edge removal | Edge |
3. Theoretical Guarantees and Complexity
Standard greedy selection for monotone submodular functions admits a approximation ratio. Recent advances generalize multiplicative guarantees to arbitrary submodular objectives using trajectory-restricted curvature, even amid negativity and nonmonotonicity (Chen et al., 8 May 2026). In sparse recovery, TMP’s tree-pruning strategy provably achieves support-exact recovery if the RIP constants of the sensing matrix satisfy explicit bounds.
For DNN pruning:
- Greedy Forward Selection demonstrates convergence rates in the overparameterized setting under mild convexity and coverage assumptions; backward elimination generally lacks such guarantees (Ye et al., 2020, Ye et al., 2020).
- Pruning with smooth group Lasso regularization in matrix sensing provides explicit thresholds (e.g., ) for safe greedy removal, after which (under further conditions) gradient-based fine-tuning ensures global convergence to the target low-rank solution (Rajaraman et al., 2023).
- In modular objectives balancing saliency and diversity (LVLM token pruning), theoretical submodularity or UBQP-approximation results yield constant-factor guarantees; extensions for approximate submodularity via curvature also apply (Pei et al., 16 Jun 2025).
Greedy pruning’s complexity, absent augmentations such as tree pruning or dynamic groupwise selection, is linear in the number of candidate units times the number of removal/add cycles, but can be accelerated by early pruning of unpromising paths, grouping (clusters), or adaptive recomputation scheduling (Lee et al., 2014, Purohit et al., 2024, Liu et al., 24 Jun 2025).
4. Extensions, Practical Enhancements, and Calibration
Modern greedy pruning systems build in additional mechanisms for efficiency and robustness:
- Pre-selection and filtering: Restricting the candidate set to a small pre-selected subset (e.g., via gOMP or saliency-based heuristics) can drastically reduce the search base with negligible performance loss (Lee et al., 2014, Liu et al., 24 Jun 2025).
- Tree Pruning and Branch-and-Bound: In signal recovery, TMP completes each partial solution with the best possible noncausal “fill” and prunes if the resulting full residual exceeds the global threshold, thereby avoiding thorough enumeration yet escaping greedy myopia (Lee et al., 2014).
- Batched/Grouped operations: Batched Greedy Pruning (e.g., SlimGPT) leverages grouped Cholesky decompositions for structured units (attention heads, FFN groups), reducing overhead while maintaining near-OBS optimality (Ling et al., 2024). Cluster pruning exploits hardware alignment for edge AI (Gamanayake et al., 2020).
- Adaptive score recomputation: In operation pruning for VLMs, GSOP reuses previously computed marginal scores unless cumulative pruning causes validation accuracy to drop below scheduled thresholds, diminishing redundant score recomputation (Liu et al., 24 Jun 2025).
- Beam search and stochasticity: To avoid deterministic greedy traps, beam search is integrated (e.g., PruneSymNet), maintaining 0 alternative minimal-loss subnetworks at each removal depth (Wu et al., 2024).
- Performance-aware oracles: For multitask CNNs, the pruning step is only allowed if no single task’s loss-lift exceeds preset tolerance, dynamically adapted across layers (Ye et al., 2023).
5. Empirical Performance and Applications
Greedy pruning is consistently validated across benchmarks:
- In ImageNet-scale models (ResNet/MobileNet/ProxylessNet), greedy forward selection matches or exceeds prior art in compression-accuracy trade-off, especially when fine-tuning the pruned subnetwork rather than retraining from scratch (Ye et al., 2020, Ye et al., 2020).
- Accuracy drops with Taylor criterion in CNNs are notably lower than norm-based or activation-based pruning, maintaining 87–89% top-5 on pruned VGG-16/ImageNet with only 30% of original FLOPs (Molchanov et al., 2016).
- Operation pruning in VLMs via GSOP enables up to 70% TFLOPs reduction with only ~4% performance drop and seamless cross-task/model transfer (Liu et al., 24 Jun 2025).
- LVLMs pruned with GreedyPrune can discard 89% of visual tokens while losing less than 1% accuracy in critical benchmarks, with prefilling speedup up to 3× on H20 GPUs (Pei et al., 16 Jun 2025).
- Greedy head pruning in transformers (Greedy-Gnorm) allows up to 80% attention head removal with modest accuracy penalty, outperforming static methods and yielding up to 22.5% parameter reduction at minor performance loss (Guo et al., 4 Feb 2026).
- In multitask pruning, performance-aware global schemes retain as little as 40% of original FLOPs or 15% of parameters with sub-1–2% loss on mAP or mIoU across detectors and segmentation models (Ye et al., 2023).
- For symbolic regression, greedy loss-based edge pruning with beam search recovers exact ground-truth formulas or achieves MSE significantly lower than global-magnitude or threshold methods in >90% of public tasks (Wu et al., 2024).
- Greedy adversarial pruning (GAP) hardens compressed models against transfer attacks while sustaining clean accuracy and quantization robustness, showing transfer attack accuracy gains of ~10–15pp over magnitude pruning at fixed compression (Weiss et al., 2022).
6. Limitations and Alternative Approaches
Despite scalability and effectiveness, greedy pruning is inherently myopic: each local decision is globally suboptimal, cannot backtrack, and may irreversibly remove units critical for complex, correlated effects. This is exacerbated in settings with strong parameter interactions or non-redundant representations. Convex relaxation (e.g., per-layer mask optimization via Frank–Wolfe) can globally minimize quadratic pruning objectives, explicitly accounting for weight-weight interactions and admitting approximation bounds—empirically outperforming one-by-one greedy heuristics and closing most of the gap to the combinatorial optimum on LLMs (Roux et al., 15 Oct 2025). Further, “don’t-be-greedy” relaxation approaches scale to the largest GPT-class models with tractable per-layer compute. In signal processing, tree-pruning augments classic greedy selection to avoid infamous support misidentification.
The absence of a global or modularity guarantee in the presence of strong parameter correlation or functional redundancy complicates analysis. Recent submodularity theory generalizations partially address this, but for highly non-modular, non-additive objectives, greedy approximation ratios degrade (Chen et al., 8 May 2026).
Greedy pruning’s inability to adapt after large-scale removal (e.g., fine-tuning) can be partially mitigated by interleaved retraining or post-hoc recalibration, but caution is warranted in regimes where performance trade-offs are sharply nonlinear.
7. Summary and Emerging Directions
Greedy pruning is a central paradigm in practical model compression, subset selection, and sparse inference—leveraging task-aware, efficient, one-by-one (or groupwise) decisions to construct effective, compact representations with strong empirical and, for a wide range of objectives, theoretical guarantees. Recent innovations extend the classical framework to structured modules (transformer heads/layers, VLM operations), integrate beam search or multi-candidate paths to avoid strict myopia, and combine performance-aware oracles and hardware-aligned clustering for real-world deployment. Complementary relaxation and dynamic programming paradigms provide principled alternatives where global optimality or higher-order parameter interactions are critical (Roux et al., 15 Oct 2025).
Open problems include the development of hybrid forward–backward schemes, more expressive submodular surrogates for complex pruning objectives, and analysis extending fast greedy convergence rates to deep, nonlinear architectures. Continued empirical and theoretical synthesis in scaling greedy pruning to exascale networks, multi-task, multi-modal domains, and adversarial settings remains a foundational direction for efficient modern machine learning.