Layer Pruning & Linear Transformations

Updated 16 September 2025

Layer pruning and linear transformations are methods used in neural network compression to remove redundant layers and dimensions, thus reducing complexity and computational cost.
Techniques like sparsity enforcement, importance scoring, and similarity metrics ensure that pruned models maintain high accuracy and robustness.
Adaptive strategies and post-pruning linearization enable efficient model deployment on resource-limited devices while lowering energy consumption and retraining needs.

Layer pruning and linear transformations are central topics in neural network compression, jointly targeting the reduction of model complexity and computational cost while preserving predictive performance. Layer pruning refers to the elimination of whole layers or substantial substructures (e.g., linear projection dimensions, attention or MLP sublayers, filters), and linear transformations permeate neural network design as the dominant operations in fully connected, convolutional, and transformer architectures. Contemporary research intertwines these topics, using linear algebraic principles and data-driven criteria to inform which layers or transformations are redundant and can be safely pruned. Methods span from regularization-driven sparsification of linear projections, to explicit manipulation and reparameterization of linear mappings, to information-theoretic or representational similarity metrics guiding removal or merging of layers.

1. Pruning Dimensions and Subspaces in Linear Projections

Several works explicitly exploit the linear structure of neural architectures by introducing sparsity or importance scoring for the dimensions within each linear projection matrix. In vision transformers, critical dimensions in projections (e.g., $W_Q$ , $W_K$ , $W_V$ , and MLP weights) are identified by training with an $\ell_1$ -regularized importance score for each projection channel. After training with the regularizer $\lambda\|\hat{\mathbf{a}}\|_1$ , where $\hat{\mathbf{a}}\in\mathbb{R}^d$ are continuous relaxations of binary activity masks, a hard thresholding step is performed: $\mathbf{a}^* = \{\hat{a}_i \geq \tau\}$ Pruning is conducted by masking (selecting) only the active dimensions for subsequent linear mappings, resulting in reduced projected feature space, parameter count, and FLOPs. This approach, applied to DeiT and related ViTs, demonstrated over 40% FLOPs reduction with minimal degradation of classification accuracy (Zhu et al., 2021).

In CNNs, SVD-based decompositions separate the weight tensor $W$ into orthonormal bases $U$ , scaling factors $S$ , and recombination matrices $V$ , allowing pruning of entire singular vector components—basis functions aligned with directions of highest variance—while preserving architectural compatibility. Taylor's first-order approximation $(g\cdot s)^2$ for each scaling factor $s$ enables efficient importance assessment: unimportant bases (with low Taylor sensitivity) are removed, and further "double pruning" of output channels may be layered atop this basis pruning (Wong et al., 2021).

2. Layer-Wise and Global Pruning Strategies

Layer pruning commonly targets either whole-layer removal or a hybrid of pruning structures (layers and neurons/filters). Greedy-layer pruning (GLP) iteratively removes the most dispensable layer at each step as measured by validation accuracy, based on the assumption of pruning locality—i.e., that the optimal $n$ -layer subset includes the optimal subset for $n{-}1$ . This reduces the search from exponential to linear in the number of layers, offering dynamic model size/speed trade-offs with minimal retraining (Peer et al., 2021).

Alternately, structured methods such as FinerCut and "Pruning Everything, Everywhere, All at Once" evaluate pruning not only at the layer/block level but also at the finer granularity (e.g., attention vs. FFN sub-layers) or with simultaneous neuron and layer removal. In the latter, each iteration generates two candidate prunings—one via layer removal, one via neuron/filter removal. The candidate maintaining the highest internal representation similarity (measured by Centered Kernel Alignment, CKA) is chosen for the next pruning step (Nascimento et al., 4 Jun 2025). This approach yields highly compressed models with maintained generalization and robustness.

3. Information-Theoretic and Similarity-Based Criteria

Global, principled selection of what to prune leverages mutual information or representation similarity:

Layer-wise pruning by mutual information (MI) maximizes, at each layer, the MI between preserved features and the upstream (output) features. The selection

$d_k^l = \arg\max_{d \notin u^l_{k-1}} I(u^l, u^l_{k-1} \cup \{d\})$

greedily identifies dimensions that best preserve the information flow from output to input, resulting in smaller, dense, faster-to-compute linear transformations replacing sparse, irregular masks (Fan et al., 2021).

Consensus-based pruning (Mugnaini et al., 21 Nov 2024) combines multiple similarity measures (CKA, Procrustes, Bures, interpolated metrics) for robust layer importance ranking. Each pruning candidate's impact is assessed by how little it alters the feature representations on a calibration set, as ranked by these metrics. Consensus aggregation mitigates the pitfalls of any lone metric, yielding pruned networks that achieve substantial FLOPs (up to 78.8%) and energy savings (up to 66.99%), improved robustness, and high accuracy retention.
RKHS analysis, as in sliding layer merging (Ding et al., 26 Feb 2025), uses centered kernel alignment to identify and merge consecutive layers with nearly identical representational content—further evidence that many network transformations are redundant from a kernel similarity perspective.

4. Linearization, Layer Collapsing, and Post-Pruning Healing

Recent work harnesses the near-linear structure emergent in deep models—most notably transformers:

LayerCollapse (Shabgahi et al., 2023) employs regularization to drive the PReLU activation parameter toward linearity between MLP layers. When the negative slope $a \to 1$ , the nonlinearities can be effectively eliminated and two consecutive linear transformations $Y_\alpha=W_2\,\mathrm{PReLU}_a(W_1X + b_1)+b_2$ can be collapsed into $Y_\mathrm{linear} = W_2 W_1 X + W_2 b_1 + b_2$ , with measured error upper-bounded proportionally to $(1{-}a)^2$ .
"Your Transformer is Secretly Linear" (Razzhigaev et al., 19 May 2024) empirically establishes that, in decoder-only architectures, the layerwise embedding transformations can be linearly mapped—Procrustes similarity scores as high as 0.99. Pruning (or approximating) these near-linear blocks incurs minimal loss. Furthermore, cosine similarity-based regularization during pretraining can explicitly control the degree of inter-layer linearity and improve downstream model performance.
Methods such as ReplaceMe (Shopkhoev et al., 5 May 2025) replace a pruned block or sequence of terminated transformer layers with an estimated optimal linear transform $T^*$ , computed from pre/post-activation pairs on a calibration set. The least-squares or cosine distance minimization yields a mapping that can be combined (e.g., into the down-projection of the preceding block) to maintain performance without retraining. Similarly, LinearPatch (Chen et al., 30 May 2025) bridges activation mismatches created by pruning via a symmetric patch matrix $P = H D H^T$ —where $H$ is a Hadamard transform (mitigating token-specific outliers) and $D$ is a channel-wise scaling matrix learned from calibration data—enabling plug-and-play restoration of accuracy.

5. Implications for Model Performance, Efficiency, and Robustness

Layer and dimension pruning informed by these approaches produces highly compressed models (parameter/FLOPs count reductions exceeding 40–90% depending on method and target accuracy). The pruned models, particularly those using MI-based selection or similarity-driven greedy strategies, demonstrate:

Minimal accuracy loss, usually less than 1% top-1 error drop on large validation sets.
Substantial reduction in inference runtime and memory requirements, enabling practical deployment on mobile and resource-constrained devices.
Enhanced adversarial and out-of-distribution robustness, especially when multi-criteria (consensus) methods are used (Mugnaini et al., 21 Nov 2024, Nascimento et al., 4 Jun 2025).
Lower environmental impact—measured as energy and CO₂ emission reductions up to 68.75% in aggressive compression settings (Mugnaini et al., 21 Nov 2024).

When linearization is incorporated (e.g., via regularizers or downstream patching), the requirement for retraining ("healing") is often removed or greatly reduced, lowering the operational cost of compression. Employing a combination of depthwise and widthwise pruning—either iteratively or in tandem—can yield further efficiency without additional accuracy penalty (Ding et al., 26 Feb 2025, Nascimento et al., 4 Jun 2025).

6. Theoretical and Mechanistic Insights

The dominant role of linear transformations is further clarified by mechanistic analyses. For transformers, pruning the representation (model) dimension $d$ (as in SliceGPT (Xu et al., 6 Mar 2025)) leads to predictable performance loss described by analytic relations: $\frac{\ln \operatorname{PPL}_0(D)}{\ln \operatorname{PPL}(D)} = 1-s,\quad \ln\left(\frac{\mathrm{acc}(D)}{\mathrm{acc}_0(D)}\right) \propto 1-s$ where $s$ is the proportion pruned. Such results confirm that the representation dimension, and by extension the space of linear transformations, governs both expressivity and accuracy post-pruning.

Post-pruning methods like LinDeps (Henry et al., 29 Jul 2025) extend this perspective to CNNs by detecting and eliminating linear dependencies among feature maps using pivoted QR decomposition, then adjusting downstream kernels to preserve compatible input spaces—thereby optimizing not just for redundancy removal, but also for seamless signal propagation without requiring further retraining.

7. Future Directions

Current methodologies suggest several avenues for innovation:

Adaptive, hybrid pruning strategies that combine layer, filter, dimension, and attention head pruning, guided by unified information, similarity, or linear algebraic metrics.
Deeper investigation into non-uniform, heterogeneous architectures, motivated by empirical observations (e.g., FinerCut's differential pruning of FFN versus attention layers (Zhang et al., 28 May 2024)).
Development of linearity-aware regularization, merging, or activation manipulation techniques for further compression, ideally characterized by analytic bounds.
Advanced post-processing and healing techniques (e.g., knowledge distillation-based fine-tuning of patch matrices (Chen et al., 30 May 2025)) to mitigate residual performance drops.

These directions underscore the centrality of linear transformations—both as the primary computation in deep nets and as the mathematical structure underlying advanced compression and pruning strategies.