Papers
Topics
Authors
Recent
2000 character limit reached

Stabilized LayerNorm Pruning (SLNP)

Updated 21 December 2025
  • The paper introduces SLNP, which leverages Fisher information and Lâ‚‚-preserving re-initialization to enable efficient fine-tuning with minimal accuracy loss.
  • SLNP stabilizes LayerNorm by compensating for scale disruptions after aggressive pruning, ensuring consistent activation statistics.
  • Empirical results show that SLNP dramatically reduces trainable parameters—down to as low as 0.015% for LayerNorm—while maintaining near-baseline performance.

Stabilized LayerNorm Pruning (SLNP) is a principled methodology designed for ultra-efficient fine-tuning and aggressive model compression in large-scale transformers, addressing both theoretical sparsity bounds and empirical stability constraints. SLNP leverages the peculiar sensitivity of LayerNorm (or RMSNorm) affine parameters to pruning events, combining Fisher information-based masking and Lâ‚‚-magnitude preserving re-initialization to enable pruning scenarios with minimal loss in downstream accuracy or convergence behavior.

1. Motivation and Scope

LLMs and encoder architectures such as BERT-large possess hundreds of millions of parameters, rendering full fine-tuning both computationally prohibitive and prone to overfitting, especially under data scarcity. Structured pruning—removing hidden channels or entire attention heads—offers multiplicative acceleration but induces instability in normalization layers when corresponding dimensions of the scale vector γ are excised. Empirically, output LayerNorm parameters shift substantially during task adaptation, suggesting their disproportionate impact on generalization when compared to other block constituents (ValizadehAslani et al., 2024). SLNP specifically targets this normalization bottleneck: it enables parameter-efficient fine-tuning by selecting and re-scaling only a sparse subset of LayerNorm or RMSNorm parameters while maintaining calibration of output distributions following aggressive axiswise pruning (Chen et al., 26 May 2025).

2. Mathematical Formulation

SLNP encompasses two tightly linked procedures:

2.1. Fisher Information Ranking for Masking

Given network loss Ln\mathcal{L}_n for example nn, LayerNorm at layer ll, coordinate jj maintains gain ωl,j\omega_{l,j} and bias bl,jb_{l,j}.

Diagonal Fisher score is:

F(θi)=1N∑n=1N(∂Ln∂θi)2F(\theta_i) = \frac{1}{N} \sum_{n=1}^N \left( \frac{\partial \mathcal{L}_n}{\partial \theta_i} \right)^2

for parameter θi∈{ωl,j,bl,j}\theta_i \in \{\omega_{l,j}, b_{l,j}\}.

Backpropagation yields:

∂Ln∂ωl,j=gn,l,j x^n,l,j;∂Ln∂bl,j=gn,l,j\frac{\partial \mathcal{L}_n}{\partial \omega_{l,j}} = g_{n,l,j} \, \hat x_{n,l,j} ; \quad \frac{\partial \mathcal{L}_n}{\partial b_{l,j}} = g_{n,l,j}

where gn,l,j=∂Ln/∂yn,l,jg_{n,l,j} = \partial \mathcal{L}_n / \partial y_{n,l,j}, with yn,l,j=ωl,j x^n,l,j+bl,jy_{n,l,j} = \omega_{l,j} \, \hat x_{n,l,j} + b_{l,j}.

Corresponding Fisher informations:

F(ωl,j)=1N∑n=1N(gn,l,j x^n,l,j)2F(\omega_{l,j}) = \frac{1}{N} \sum_{n=1}^N (g_{n,l,j} \, \hat x_{n,l,j})^2

F(bl,j)=1N∑n=1N(gn,l,j)2F(b_{l,j}) = \frac{1}{N} \sum_{n=1}^N (g_{n,l,j})^2

2.2. Scale Restoration after Channel Pruning

For RMSNorm with scale γ∈Rd\gamma \in \mathbb{R}^d, after pruning hidden channels, the retained scale is γpruned∈Rd′\gamma^{\rm pruned} \in \mathbb{R}^{d'} (d′<dd' < d). To restore normalization output statistics, SLNP computes a compensation scalar cc:

c=∥γorig∥2∥γpruned∥2c = \frac{ \| \gamma^{\rm orig} \|_2 }{ \| \gamma^{\rm pruned} \|_2 }

with γorig\gamma^{\rm orig} as the original values mapped to retained indices. The final scale vector:

γnew=c⋅γpruned\gamma^{\rm new} = c \cdot \gamma^{\rm pruned}

thus guarantees ∥γnew∥2=∥γorig∥2\| \gamma^{\rm new} \|_2 = \| \gamma^{\rm orig} \|_2, restoring affine magnitude post-pruning (Chen et al., 26 May 2025).

3. Algorithmic Workflow

SLNP is sequentially positioned as follows:

  • Compute Fisher scores for all LayerNorm parameters using a single forward-backward pass over the calibration set.
  • Sort and select the top ff fraction parameters per layer (k=⌊fâ‹…NLN⌋k = \lfloor f \cdot N_{\text{LN}} \rfloor).
  • Mask all non-selected parameters, freezing their values at pre-trained initialization.
  • Following activation-based structured pruning (width, depth, attention), apply scale compensation to pruned normalization vectors as described above.
  • Proceed to fine-tuning or distillation, updating only unmasked parameters with a moderately higher learning rate (e.g., up to 10−310^{-3}), capitalizing on LayerNorm’s intrinsic stabilizing effects (ValizadehAslani et al., 2024, Chen et al., 26 May 2025).

4. Stability Considerations

Stability in SLNP arises from three interlocking mechanisms:

  • Only pre-trained values of masked LayerNorm/RMSNorm parameters are preserved; no re-normalization across the model occurs.
  • The Fisher-score selection preferentially retains parameters with maximal expected impact on loss reduction, minimizing instability from low-importance parameter updates.
  • The scale restoration correction ensures LayerNorm/RMSNorm output magnitudes remain consistent pre- and post-pruning, mitigating drift in activation statistics that typically arises from aggressive channel/width pruning.
  • If extremely aggressive pruning is performed (e.g., >80%>80\% channel removal), compensation factor cc may amplify noise; in these regimes, cc can be clamped (e.g., c≤2c \leq 2) or combined with mild Lâ‚‚ regularization for robustness (Chen et al., 26 May 2025).

5. Empirical Results

SLNP has been benchmarked across parameter-efficient adaptation and LLM acceleration:

Scenario (GLUE/Pangu) Trainable Fraction Performance Margin Notes
Full fine-tuning (BERT-large) 100% (333.6M) Baseline
BitFit (bias-only) 0.082% 1–2 pt below baseline
SLNP (LayerNorm-only) 0.015% (51,202) <1 pt Matches BitFit
SLNP (10% LN) 0.0015% (~5.1K) ~1–2 pts Beats random subset
Pangu Light w/ CLAP N/A +2.9 over baseline Depth/attention pruning
+SLNP re-initialization N/A +0.7 over CLAP Additional improvement

Statistical tests (e.g., Kruskal–Wallis p≈0.55p \approx 0.55 between BitFit/SLNP) confirm no significance between full LN masking and bias-only strategies (ValizadehAslani et al., 2024), and ablations show SLNP yields +0.7+0.7 average improvement when integrated with depth pruning (CLAP) in the Pangu Light framework (Chen et al., 26 May 2025). Figure matching of mean/std in γ\gamma distributions pre- and post-SLNP further validates preservation of scaling statistics.

6. Practical Implementation Notes

SLNP is modular, introducing negligible computational overhead in both training and inference. It has no intrinsic learning-rate or regularization hyperparameters, with the only control being the global top-K masking ratio post-Fisher sorting. Calibration-set importance scores can be aggregated across tasks to generate cross-task masks or used in leave-one-out fashion for zero-shot adaptation. The rescaling step applies per layer and integrates directly into any structured-pruning pipeline that leverages normalization layers.

In Pangu Light, SLNP synergizes with the CLAP module: whereas CLAP maintains critical cross-layer attention head features in depth-pruned networks, SLNP preserves activation scale, collectively enabling joint width/depth pruning while retaining up to 98.9%98.9\% of original accuracy at 2.1×2.1\times speed-up (Chen et al., 26 May 2025).

7. Limitations and Implications

SLNP's efficacy is contingent on the degree of pruning. Moderation is recommended; excessive pruning can elevate cc unreasonably, introducing instability. In low-resource settings, SLNP’s L₂-preserving re-initialization is expected to yield more robust starting points for distillation and adaptation. Its costless nature and compatibility with all LayerNorm/RMSNorm architectures distinguish it from alternative normalization or attention-focused pruning strategies. A plausible implication is that Fisher information-based normalization parameter selection may generalize to other affine-invariant architectures, further reducing the cost of domain adaptation and model deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Stabilized LayerNorm Pruning (SLNP).