Depth Pruning in Neural Networks

Updated 2 June 2026

Depth pruning is a technique that removes entire layers or blocks from deep neural networks to reduce computational load while maintaining task performance.
Methodologies include training-free approaches, dynamic input-conditioned masking, and layer merging guided by similarity metrics such as cosine similarity and CKA.
Empirical results demonstrate that moderate depth pruning (20–35%) can yield significant latency reductions and memory savings with minimal degradation in accuracy.

Depth pruning refers to the process of removing whole layers or blocks from a neural network with the goal of reducing its computational burden, memory footprint, and inference latency, while maintaining as much task performance (e.g., accuracy or perplexity) as possible. This structural form of model compression is especially significant for deep architectures such as transformers and convolutional neural networks, where sequential layerwise computation dominates latency and energy usage. Depth pruning is distinguished from width pruning—which removes attention heads, filters, or individual neurons—by its coarse operational granularity, targeting entire sequential computational units.

1. Formal Problem Definition and Theoretical Foundations

In transformer models, depth pruning typically involves selecting a contiguous or non-contiguous subset of layers to remove. Formally, given network blocks $B_1, B_2, ..., B_L$ , depth pruning selects indices $i+1$ to $i+n$ to be pruned, seeking to approximate the composite mapping $X_{i} \rightarrow L_{i+n}$ with a simpler or reduced operation at the cut point. The challenge is to achieve computational savings proportional to the fraction of removed layers, while minimizing any accuracy or generalization degradation.

Various theoretical frameworks have been proposed:

Importance Scoring: Blocks are ranked using criteria such as sensitivity-based Taylor approximations, loss increase upon deletion, or output similarity metrics (e.g., cosine similarity, centered kernel alignment) (Kim et al., 2024, Chen et al., 21 Apr 2026, Shopkhoev et al., 5 May 2025).
Functional Redundancy Perspective: The relevance of a layer is not intrinsic, but depends on the calibration objective, e.g. language modeling perplexity versus reasoning task accuracy, leading to different optimal sets for removal (Kim et al., 27 Apr 2026).
Structural Constraints: In CNNs, depth pruning is complicated by normalization, residual, and activation layers, necessitating compatibility in parameter shapes after block removal (Liu et al., 2024).

2. Representative Depth Pruning Methodologies

Depth pruning methodologies differ in their criteria for block selection, their recovery strategies post-pruning, and the architectural manipulations they employ.

a) Training-Free Depth Pruning

ReplaceMe fits a single linear transformation using a small calibration set to approximate the effect of the discarded blocks, merging it into the remaining network and requiring no retraining. The best cut index is chosen by minimizing an activation distance (often cosine). It supports closed-form least squares or cosine-distance minimization and achieves up to 25% block removal with ~90% performance retention in minutes, with computational cost dominated by LS solve $O(d^2NS + d^3)$ (Shopkhoev et al., 5 May 2025).
One-Shot Importance-Based Pruning assesses block contribution via single-block deletion effect (change in perplexity or task margin), then removes blocks with least estimated impact. Recovery is typically achieved via post-pruning fine-tuning (e.g. LoRA) or by continued pretraining. Speedups of 23–35% are reported at 20–35% pruning for LLMs (Kim et al., 2024).

b) Similarity and Difference Metrics

SimDiff assigns each layer an importance score that combines representational similarity (cosine distance of hidden states) and transformation difference (mean absolute/squared deviation in output magnitude), controlled by a mixing hyperparameter. This twofold criterion mitigates catastrophic accuracy collapse, outperforming solely similarity-driven methods across multiple LLM architectures and yielding 1.49× inference speedup at moderate depth reduction (Chen et al., 21 Apr 2026).

c) Dynamic, Input-Conditioned Pruning

IG-Pruning clusters a calibration corpus to discover input-specific masks via L0 optimization, with mask selection at inference guided by semantic similarity between input and cluster centroids. This enables dynamic, prompt-sensitive skipping of blocks, outperforming static masks, especially at moderate sparsity (15–30%) (Qiao et al., 4 Nov 2025).
PuDDing employs a learned router to select among data-driven candidate omission sets based on the input prompt, enabling prompt-conditional depth configurations with minimal routing overhead (Wee et al., 4 Feb 2025).

d) Layer Merging and Advanced Fusion Schemes

Sliding Layer Merging (SLM) iteratively merges consecutive layers with high representational similarity (quantified, e.g., by CKA or cosine similarity), collapsing their weights to preserve information rather than discarding full blocks. This yields a smoother performance curve with up to 1.65% accuracy gain over baseline pruning at 35% removal (Ding et al., 26 Feb 2025).
LayerMerge targets both convolution and activation layers, jointly pruning and merging them to avoid kernel size blowup that undermines latency gains in conventional merging. This is solved via a dynamic programming approach over all valid segmentations, achieving globally optimal trade-offs under a latency constraint (Kim et al., 2024).

e) Special Architectures and Progressive/Joint Strategies

Entropy-Guided Pruning (EGP) leverages low activation entropy to identify and remove layers that have lost nonlinearity (i.e., always ON or OFF neurons), allowing strong unstructured pruning schemes to achieve true depth reduction (Liao et al., 2023).
UPDP introduces block-wise progressive interpolation between kept and pruned blocks (annealing with a gating coefficient), coupled with supernet and sandwich training. This enables robust pruning in models with non-trivial block structures, outperforming prior methods in both CNN and Vision Transformer backbones (Liu et al., 2024).
Joint Multi-Dimension Pruning encodes depth, channel, and spatial pruning as components of a continuous pruning vector, with depth controlled by a normalized variable optimized by gradient estimation methods (e.g., finite difference, Gaussian-smoothing), demonstrating measurable additive gains over channel-only pruning (Liu et al., 2020).

3. Calibration Objectives, Search Algorithms, and Practical Trade-offs

The choice of calibration objective (loss function for pruning evaluation) is crucial. Empirical studies show that perplexity and downstream accuracy may yield non-correlated rankings of redundant layers, and calibration objective selection is more influential than choice of search algorithm (greedy, beam, genetic, Bayesian, or binary optimization) in determining pruned performance. One-shot or greedy search suffices once the objective is specified (Kim et al., 27 Apr 2026).

For pruning budget allocation and redundancy detection:

Locality-Aware Redundancy Pruning (LoRP) quantifies the global distribution of redundancy using Representation Locality Score (RLS) from inter-layer cosine similarity and utilizes spectral clustering to assign pruning budgets within clusters of redundant layers. This approach adapts depth-pruning patterns to architecture-specific representational structure and consistently improves perplexity and task accuracy over fixed-criterion one-shot methods (Yun et al., 27 May 2026).

4. Empirical Results, Efficiency Gains, and Recovery

Empirical evidence from multiple studies demonstrates that moderate depth pruning (20–35%) achieves 20–35% latency reductions and 2–6 GB memory savings in large LLMs (e.g., LLaMA-7B) with minimal accuracy loss (within 5–10%). The best methods retain over ~90% of task performance and often match or surpass width-pruned baselines under memory-bound inference, due to more favorable hardware/throughput properties (Kim et al., 2024, Shopkhoev et al., 5 May 2025, Chen et al., 21 Apr 2026).

Hybrid approaches that combine depth and width pruning, using criteria such as Centered Kernel Alignment (CKA) for selection and tie-breaking, can yield even higher compression ratios (e.g., up to 86% FLOPs reduction on ResNet-56) and boost adversarial and out-of-distribution robustness (Nascimento et al., 4 Jun 2025).

Table: Illustrative Outcomes of Depth Pruning on LLMs

Method	Prune Ratio	Relative Acc.	Latency Speedup	Hardware/Recovery
ReplaceMe	25%	89.9–92.5%	up to 1.3×	train-free, LS/Cosine solvers
SimDiff	25%	91.4%	up to 1.49×	closed-form, with LoRA optional
IG-Pruning	25%	86.4–87.2%	10–25%	input-dynamic, no retraining
Shortened-LLM	20–35%	92–96%	23–35%	LoRA fine-tune, memory saved
SLM	35% (Vicuna)	+1.65% gain	best-in-class	layer merging, LoRA recovery

Extremely aggressive depth reduction may require a post-pruning "healing" phase of fine-tuning (e.g., low-rank adaptation or continued pretraining), although some time-optimal methods (e.g., ReplaceMe) maintain competitive accuracy even without retraining (Shopkhoev et al., 5 May 2025).

5. Specializations, Hardware Considerations, and Limitations

Depth pruning in convolutional and TinyML contexts is adapted for compatibility with resource-constrained devices. Auxiliary networks attached at the truncation point (i.e., new lightweight heads) can restore accuracy and enable extreme parameter reduction (e.g., 93% on MLPerfTiny VWW) while remaining hardware-agnostic, i.e., requiring no sparse-matrix support (Leon et al., 2022).

Merging-based and depth-2 pruning schemes are favored for further hardware efficiency, ensuring that pruned models preserve structural regularity and map well to dense acceleration hardware (TensorCores, vector units). In contrast, kernel-size blowup in naive merging can counteract benefits, motivating joint selection over activation and convolutional layer pruning (Kim et al., 2024).

Trade-offs and limitations include:

Performance recovery at high pruning ratios (>30–40%) is not always achievable without retraining.
Calibration set size and type influence accuracy, with instruction-tuned text and sufficient examples (e.g., 1,000–16,000) yielding more reliable transformations (Shopkhoev et al., 5 May 2025).
Dynamic, prompt-conditional depth pruning requires effective routing architectures and can be sensitive to clustering quality or out-of-distribution inputs (Qiao et al., 4 Nov 2025, Wee et al., 4 Feb 2025).
Automating hyperparameter tuning (e.g., for merging thresholds, progressive annealing schedules) remains an open direction for robustness (Liu et al., 2024, Ding et al., 26 Feb 2025).

6. Recent Directions and Future Prospects

Depth pruning technology has expanded into neural ODEs (continuous-depth models), where iterative magnitude-based pruning improves generalization and alleviates mode collapse, achieving up to 98% parameter reduction without loss of density modeling accuracy (Liebenwein et al., 2021).

For explicit latency control and real deployment, joint layer and activation pruning formulated as dynamic-programming search provides globally optimal depth-merge configurations for a user-specified speedup budget (Kim et al., 2024).

Emerging frontiers include hybrid depth/width pruning at both coarse (layer) and fine (neuron/head) scales, dynamic input-conditional block skipping, and theoretical grounding of functional redundancy within and across architectures (Nascimento et al., 4 Jun 2025, Qiao et al., 4 Nov 2025, Kim et al., 27 Apr 2026).

For additional technical detail and implementation recipes across paradigms, see "ReplaceMe" (Shopkhoev et al., 5 May 2025), "SimDiff" (Chen et al., 21 Apr 2026), "IG-Pruning" (Qiao et al., 4 Nov 2025), "Shortened LLaMA" (Kim et al., 2024), "Sliding Layer Merging" (Ding et al., 26 Feb 2025), and the comprehensive treatment in "Rethinking Layer Redundancy" (Kim et al., 27 Apr 2026).