The Curse of Depth in Large Language Models (2502.05795v2)

Published 9 Feb 2025 in cs.LG and cs.AI

Abstract: In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern LLMs where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.

Summary

The paper introduces the Curse of Depth, showing that deeper layers in LLMs contribute less to learning.
It identifies Pre-LN normalization as the cause of exponential variance growth and proposes LayerNorm Scaling to counteract it.
Empirical tests on models from 130M to 1B parameters demonstrate that LayerNorm Scaling reduces perplexity and enhances training efficiency.

The paper introduces the concept of the Curse of Depth (CoD) in LLMs, where deeper layers contribute less to learning compared to earlier layers. The authors confirm the existence of this phenomenon across popular LLM families like Llama, Mistral, DeepSeek, and Qwen. They attribute the root cause of CoD to the widespread use of Pre-Layer Normalization (Pre-LN), which leads to an exponential growth of output variance with model depth, causing the derivatives of deep Transformer blocks to approach an identity matrix.

To address this issue, the authors propose LayerNorm Scaling, which scales the output variance of the layer normalization inversely by the square root of its depth, shown as $\frac{1}{\sqrt{l}}$ . This modification mitigates the output variance explosion in deeper layers, improving their contribution.

$\mathbf{h}^{(l)} = \text{LayerNorm}(\mathbf{h}^{(\ell)}) \times \frac{1}{\sqrt{\ell}}$

Where:

$\mathbf{h}^{(l)}$ is the modified output
$\mathbf{h}^{(\ell)}$ denotes the input to Layer Normalization at layer $\ell$

The paper includes theoretical and empirical evidence to support these claims. Experiments on models ranging from 130M to 1B parameters show that LayerNorm Scaling enhances LLM pre-training performance compared to Pre-LN, and this improvement carries over to supervised fine-tuning.

The empirical evidence of CoD is demonstrated through layer pruning experiments on LLMs. The performance drop, $\Delta P^{(\ell)}$ , is quantified as:

$\Delta P^{(\ell)} = P_{\text{pruned}^{(\ell)}} - P_{\text{original}}$

Where:

$P_{\text{original}}$ represents the performance of the unpruned model
$P_{\text{pruned}^{(\ell)}}$ denotes the performance after removing layer $\ell$

The findings indicate that models with Pre-LN are robust to the removal of deeper layers, whereas BERT with Post-LN shows the opposite trend.

The paper analyzes Pre-LN Transformers, where the layer output $y$ is calculated as:

$y = x_{\ell+1} = x^\prime_\ell + \mathrm{FFN}(\mathrm{LN}(x^\prime_\ell))$

$x^\prime_\ell = x_\ell + \mathrm{Attn}(\mathrm{LN}(x_\ell))$

Where:

$x_\ell \in \mathbb{R}^d$ is the input vector at the $\ell$ -th layer
$d$ denotes the feature dimension of each layer
LN denotes the layer normalization function
FFN is the feed-forward network
Attn is the multi-head self-attention

The derivatives of Pre-LN Transformers are analyzed, and the variance $\sigma^2_{x_\ell}$ of vector $x_{\ell}$ is computed. Under Assumption 1, which posits that $x_\ell$ , $x^\prime_\ell$ , and $W_\ell$ (model parameter matrix at the $\ell$ -th layer) follow normal and independent distributions with mean $\mu = 0$ , Lemma 1 states that the variances $\sigma^2_{x^\prime_\ell}$ and $\sigma^2_{x_\ell}$ exhibit the same overall growth trend:

$\sigma^2_{x_{\ell}} = \sigma_{x_1}^2 \Theta\Bigl(\prod_{k=1}^{\ell-1} \left( 1 + \frac{1}{\sigma_{x_k}} \right) \Bigr)$

Theorem 1 indicates that for a Pre-LN Transformer with $L$ layers, the partial derivative $\frac{\partial y_L}{\partial x_1}$ can be written as:

$\frac{\partial y_L}{\partial x_1} = \prod_{\ell=1}^{L-1} \left( \frac{\partial y_\ell}{\partial x^\prime_\ell} \cdot \frac{\partial x^\prime_\ell}{\partial x_\ell} \right)$

The Euclidean norm of $\frac{\partial y_L}{\partial x_1}$ is given by:

$\left\| \frac{ \partial y_L}{\partial x_1} \right\|_2 \leq \prod_{l=1}^{L-1} \left( 1 + \frac{1}{\sigma_{x_\ell} A + \frac{1}{\sigma_{x_\ell}^2} B} \right)$

Where:

$A$ and $B$ are constants for the Transformer network

The authors also present a theoretical analysis of LayerNorm Scaling, showing that it effectively slows the growth of the variance upper bound, reducing it from exponential to polynomial growth. Lemma 2 shows that, after applying scaling, the variances of $x^\prime_\ell$ and $x_\ell$ exhibit the same growth trend, as follows:

$\sigma^2_{x_{\ell+1}} =\sigma_{x_\ell}^2 \Theta( 1 + \frac{1}{\sqrt{\ell} \sigma_{x_\ell}})$

Theorem 2 shows that, for the scaled Pre-LN Transformers, the Euclidean norm of $\frac{\partial y_L}{\partial x_1}$ is given by:

$\left\| \frac{ \partial y_L}{\partial x_1} \right\|_2 \leq \prod_{\ell=1}^{L-1} \left( 1 + \frac{1}{\ell\sigma_{x_\ell} A + \frac{1}{\ell^2\sigma_{x_\ell}^2} B} \right)$

Where:

$A$ and $B$ are dependent on the scaled neural network parameters

Experiments include LLM pre-training and supervised fine-tuning. Results demonstrate that LayerNorm Scaling consistently outperforms other normalization methods across different model sizes. For instance, on LLaMA-130M and LLaMA-1B, LayerNorm Scaling reduces perplexity by 0.97 and 1.31, respectively, compared to Pre-LN.

The authors conduct a layer pruning experiment on LLaMA-130M, removing individual layers and measuring the performance drop ( $\Delta P^{(\ell)}$ ) on the ARC-e benchmark, to demonstrate how LayerNorm Scaling improves deep layer effectiveness.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1891242316603568536

https://twitter.com/swayaminsync/status/1902982893917368790

https://twitter.com/f14bertolotti/status/1889577402256208380

https://twitter.com/Shiwei_Liu66/status/1900485922459717727

https://twitter.com/susumuota/status/1892368929420411229

https://twitter.com/arXivGPT/status/1889737541915422888

YouTube

Show All Videos

HackerNews

The Curse of Depth in Large Language Models [pdf] (2 points, 0 comments)
The Curse of Depth in Large Language Models (2 points, 0 comments)
The Curse of Depth in Large Language Models (1 point, 0 comments)