Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Curse of Depth in Large Language Models (2502.05795v2)

Published 9 Feb 2025 in cs.LG and cs.AI

Abstract: In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern LLMs where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.

Summary

  • The paper introduces the Curse of Depth, showing that deeper layers in LLMs contribute less to learning.
  • It identifies Pre-LN normalization as the cause of exponential variance growth and proposes LayerNorm Scaling to counteract it.
  • Empirical tests on models from 130M to 1B parameters demonstrate that LayerNorm Scaling reduces perplexity and enhances training efficiency.

The paper introduces the concept of the Curse of Depth (CoD) in LLMs, where deeper layers contribute less to learning compared to earlier layers. The authors confirm the existence of this phenomenon across popular LLM families like Llama, Mistral, DeepSeek, and Qwen. They attribute the root cause of CoD to the widespread use of Pre-Layer Normalization (Pre-LN), which leads to an exponential growth of output variance with model depth, causing the derivatives of deep Transformer blocks to approach an identity matrix.

To address this issue, the authors propose LayerNorm Scaling, which scales the output variance of the layer normalization inversely by the square root of its depth, shown as 1l\frac{1}{\sqrt{l}}. This modification mitigates the output variance explosion in deeper layers, improving their contribution.

h(l)=LayerNorm(h())×1\mathbf{h}^{(l)} = \text{LayerNorm}(\mathbf{h}^{(\ell)}) \times \frac{1}{\sqrt{\ell}}

Where:

  • h(l)\mathbf{h}^{(l)} is the modified output
  • h()\mathbf{h}^{(\ell)} denotes the input to Layer Normalization at layer \ell

The paper includes theoretical and empirical evidence to support these claims. Experiments on models ranging from 130M to 1B parameters show that LayerNorm Scaling enhances LLM pre-training performance compared to Pre-LN, and this improvement carries over to supervised fine-tuning.

The empirical evidence of CoD is demonstrated through layer pruning experiments on LLMs. The performance drop, ΔP()\Delta P^{(\ell)}, is quantified as:

ΔP()=Ppruned()Poriginal\Delta P^{(\ell)} = P_{\text{pruned}^{(\ell)}} - P_{\text{original}}

Where:

  • PoriginalP_{\text{original}} represents the performance of the unpruned model
  • Ppruned()P_{\text{pruned}^{(\ell)}} denotes the performance after removing layer \ell

The findings indicate that models with Pre-LN are robust to the removal of deeper layers, whereas BERT with Post-LN shows the opposite trend.

The paper analyzes Pre-LN Transformers, where the layer output yy is calculated as:

y=x+1=x+FFN(LN(x))y = x_{\ell+1} = x^\prime_\ell + \mathrm{FFN}(\mathrm{LN}(x^\prime_\ell))

x=x+Attn(LN(x))x^\prime_\ell = x_\ell + \mathrm{Attn}(\mathrm{LN}(x_\ell))

Where:

  • xRdx_\ell \in \mathbb{R}^d is the input vector at the \ell-th layer
  • dd denotes the feature dimension of each layer
  • LN denotes the layer normalization function
  • FFN is the feed-forward network
  • Attn is the multi-head self-attention

The derivatives of Pre-LN Transformers are analyzed, and the variance σx2\sigma^2_{x_\ell} of vector xx_{\ell} is computed. Under Assumption 1, which posits that xx_\ell, xx^\prime_\ell, and WW_\ell (model parameter matrix at the \ell-th layer) follow normal and independent distributions with mean μ=0\mu = 0, Lemma 1 states that the variances σx2\sigma^2_{x^\prime_\ell} and σx2\sigma^2_{x_\ell} exhibit the same overall growth trend:

σx2=σx12Θ(k=11(1+1σxk))\sigma^2_{x_{\ell}} = \sigma_{x_1}^2 \Theta\Bigl(\prod_{k=1}^{\ell-1} \left( 1 + \frac{1}{\sigma_{x_k}} \right) \Bigr)

Theorem 1 indicates that for a Pre-LN Transformer with LL layers, the partial derivative yLx1\frac{\partial y_L}{\partial x_1} can be written as:

yLx1==1L1(yxxx)\frac{\partial y_L}{\partial x_1} = \prod_{\ell=1}^{L-1} \left( \frac{\partial y_\ell}{\partial x^\prime_\ell} \cdot \frac{\partial x^\prime_\ell}{\partial x_\ell} \right)

The Euclidean norm of yLx1\frac{\partial y_L}{\partial x_1} is given by:

yLx12l=1L1(1+1σxA+1σx2B)\left\| \frac{ \partial y_L}{\partial x_1} \right\|_2 \leq \prod_{l=1}^{L-1} \left( 1 + \frac{1}{\sigma_{x_\ell} A + \frac{1}{\sigma_{x_\ell}^2} B} \right)

Where:

  • AA and BB are constants for the Transformer network

The authors also present a theoretical analysis of LayerNorm Scaling, showing that it effectively slows the growth of the variance upper bound, reducing it from exponential to polynomial growth. Lemma 2 shows that, after applying scaling, the variances of xx^\prime_\ell and xx_\ell exhibit the same growth trend, as follows:

σx+12=σx2Θ(1+1σx)\sigma^2_{x_{\ell+1}} =\sigma_{x_\ell}^2 \Theta( 1 + \frac{1}{\sqrt{\ell} \sigma_{x_\ell}})

Theorem 2 shows that, for the scaled Pre-LN Transformers, the Euclidean norm of yLx1\frac{\partial y_L}{\partial x_1} is given by:

yLx12=1L1(1+1σxA+12σx2B)\left\| \frac{ \partial y_L}{\partial x_1} \right\|_2 \leq \prod_{\ell=1}^{L-1} \left( 1 + \frac{1}{\ell\sigma_{x_\ell} A + \frac{1}{\ell^2\sigma_{x_\ell}^2} B} \right)

Where:

  • AA and BB are dependent on the scaled neural network parameters

Experiments include LLM pre-training and supervised fine-tuning. Results demonstrate that LayerNorm Scaling consistently outperforms other normalization methods across different model sizes. For instance, on LLaMA-130M and LLaMA-1B, LayerNorm Scaling reduces perplexity by 0.97 and 1.31, respectively, compared to Pre-LN.

The authors conduct a layer pruning experiment on LLaMA-130M, removing individual layers and measuring the performance drop (ΔP()\Delta P^{(\ell)}) on the ARC-e benchmark, to demonstrate how LayerNorm Scaling improves deep layer effectiveness.

Youtube Logo Streamline Icon: https://streamlinehq.com