The Curse of Depth in Large Language Models
(2502.05795v2)
Published 9 Feb 2025 in cs.LG and cs.AI
Abstract: In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern LLMs where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.
Summary
The paper introduces the Curse of Depth, showing that deeper layers in LLMs contribute less to learning.
It identifies Pre-LN normalization as the cause of exponential variance growth and proposes LayerNorm Scaling to counteract it.
Empirical tests on models from 130M to 1B parameters demonstrate that LayerNorm Scaling reduces perplexity and enhances training efficiency.
The paper introduces the concept of the Curse of Depth (CoD) in LLMs, where deeper layers contribute less to learning compared to earlier layers. The authors confirm the existence of this phenomenon across popular LLM families like Llama, Mistral, DeepSeek, and Qwen. They attribute the root cause of CoD to the widespread use of Pre-Layer Normalization (Pre-LN), which leads to an exponential growth of output variance with model depth, causing the derivatives of deep Transformer blocks to approach an identity matrix.
To address this issue, the authors propose LayerNorm Scaling, which scales the output variance of the layer normalization inversely by the square root of its depth, shown as l1. This modification mitigates the output variance explosion in deeper layers, improving their contribution.
h(l)=LayerNorm(h(ℓ))×ℓ1
Where:
h(l) is the modified output
h(ℓ) denotes the input to Layer Normalization at layer ℓ
The paper includes theoretical and empirical evidence to support these claims. Experiments on models ranging from 130M to 1B parameters show that LayerNorm Scaling enhances LLM pre-training performance compared to Pre-LN, and this improvement carries over to supervised fine-tuning.
The empirical evidence of CoD is demonstrated through layer pruning experiments on LLMs. The performance drop, ΔP(ℓ), is quantified as:
ΔP(ℓ)=Ppruned(ℓ)−Poriginal
Where:
Poriginal represents the performance of the unpruned model
Ppruned(ℓ) denotes the performance after removing layer ℓ
The findings indicate that models with Pre-LN are robust to the removal of deeper layers, whereas BERT with Post-LN shows the opposite trend.
The paper analyzes Pre-LN Transformers, where the layer output y is calculated as:
y=xℓ+1=xℓ′+FFN(LN(xℓ′))
xℓ′=xℓ+Attn(LN(xℓ))
Where:
xℓ∈Rd is the input vector at the ℓ-th layer
d denotes the feature dimension of each layer
LN denotes the layer normalization function
FFN is the feed-forward network
Attn is the multi-head self-attention
The derivatives of Pre-LN Transformers are analyzed, and the variance σxℓ2 of vector xℓ is computed. Under Assumption 1, which posits that xℓ, xℓ′, and Wℓ (model parameter matrix at the ℓ-th layer) follow normal and independent distributions with mean μ=0, Lemma 1 states that the variances σxℓ′2 and σxℓ2 exhibit the same overall growth trend:
σxℓ2=σx12Θ(k=1∏ℓ−1(1+σxk1))
Theorem 1 indicates that for a Pre-LN Transformer with L layers, the partial derivative ∂x1∂yL can be written as:
∂x1∂yL=ℓ=1∏L−1(∂xℓ′∂yℓ⋅∂xℓ∂xℓ′)
The Euclidean norm of ∂x1∂yL is given by:
∂x1∂yL2≤l=1∏L−11+σxℓA+σxℓ21B1
Where:
A and B are constants for the Transformer network
The authors also present a theoretical analysis of LayerNorm Scaling, showing that it effectively slows the growth of the variance upper bound, reducing it from exponential to polynomial growth. Lemma 2 shows that, after applying scaling, the variances of xℓ′ and xℓ exhibit the same growth trend, as follows:
σxℓ+12=σxℓ2Θ(1+ℓσxℓ1)
Theorem 2 shows that, for the scaled Pre-LN Transformers, the Euclidean norm of ∂x1∂yL is given by:
∂x1∂yL2≤ℓ=1∏L−11+ℓσxℓA+ℓ2σxℓ21B1
Where:
A and B are dependent on the scaled neural network parameters
Experiments include LLM pre-training and supervised fine-tuning. Results demonstrate that LayerNorm Scaling consistently outperforms other normalization methods across different model sizes. For instance, on LLaMA-130M and LLaMA-1B, LayerNorm Scaling reduces perplexity by 0.97 and 1.31, respectively, compared to Pre-LN.
The authors conduct a layer pruning experiment on LLaMA-130M, removing individual layers and measuring the performance drop (ΔP(ℓ)) on the ARC-e benchmark, to demonstrate how LayerNorm Scaling improves deep layer effectiveness.