DeepScaleLM: Ultra-Deep Transformer Scaling

Updated 31 January 2026

DeepScaleLM is an initialization and scaling scheme for Transformers that preserves unit output and gradient moments to enable stable training of models up to 1000 layers.
It employs precise residual and skip-connection scaling alongside dropout strategies to prevent vanishing/exploding gradients and rank collapse across various architectures.
Empirical results demonstrate its effectiveness in improving language modeling, speech translation, and image classification performance compared to shallow baselines.

DeepScaleLM is an initialization and scaling scheme for Transformer-based deep neural networks, developed to enable stable and efficient training of very deep models—up to 1000 layers—by rigorously conserving unit output and gradient moments at initialization. DeepScaleLM was introduced by Kedia et al. in the context of a unified theory of signal propagation in Transformers, with exact analytical recurrences for forward and backward signal variances, aiming to prevent vanishing/exploding gradients, rank collapse, and instability associated with large-scale models. DeepScaleLM demonstrates superior empirical performance on language modeling, speech translation, and image classification across encoder-only, decoder-only, and encoder–decoder architectures, outperforming shallow models of equivalent parameter count (Kedia et al., 2024).

1. Unified Signal Propagation Theory

Signal propagation in Transformers governs the behavior of forward and backward variances and covariances of activations throughout the model’s layers. At initialization, each sub-layer in the Transformer can be described by its effect on input second-moment statistics (mean $\mu$ , variance $\sigma^2$ , and inter-token correlation $r$ ). Component-wise moment propagation is analytically tractable for the following computations:

Linear layers: For $y = Wx$ with $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ , $\mathrm{Var}(y_i) = d\,\sigma_W^2\,\sigma_{\mathrm{in}}^2$ .
ReLU: $\mathbb{E}[y_i]=\frac{\sigma_{\mathrm{in}}}{\sqrt{2\pi}}$ , $\mathrm{Var}(y_i)=\sigma_{\mathrm{in}}^2\Bigl(\tfrac12-\tfrac1{2\pi}\Bigr)$ , and $\mathrm{Corr}(y_i, y_j) = \frac{1}{\pi}\bigl(\arcsin(r_{\mathrm{in}}) + r_{\mathrm{in}}\bigr)$ .
Dropout: Variance is preserved, correlation updated as $\mathrm{Corr}(y_i, y_j) = p^2 + (1-p)\rho_{ij}$ .
LayerNorm: Always outputs unit-variance, $\sigma^2$ 0.
Attention/Softmax: For $\sigma^2$ 1 tokens, $\sigma^2$ 2 for $\sigma^2$ 3. Gradients in backward pass are scaled by $\sigma^2$ 4.

At the block level, for the $\sigma^2$ 5th transformer block:

Attention Block:

$\sigma^2$ 6

Backpropagated gradient: $\sigma^2$ 7.

FFN Block:

$\sigma^2$ 8

Backward gain: $\sigma^2$ 9.

Whole-network recurrences for Pre-LN Transformers predict linear growth in forward-pass variance and hyperbolic growth in backward gradients; for Post-LN, forward variance is preserved, but backward gradients grow or decay exponentially in depth. Empirical evaluations verify these recurrences to within 10% accuracy even at extreme percentiles (Kedia et al., 2024).

2. DeepScaleLM Initialization and Scaling Principles

DeepScaleLM directly targets the stabilization of signal propagation by enforcing, at initialization, (i) unit forward variance ( $r$ 0) and (ii) unit backward variance ( $r$ 1) at every sub-layer, and (iii) dropout or residual scaling to guarantee $r$ 2 such that rank collapse is provably avoided.

Residual/skip-connection scaling is implemented by expressing the output as:

$r$ 3

with $r$ 4. DeepScaleLM fixes $r$ 5 (where $r$ 6 is the total number of layers), which results in $r$ 7; this ensures neither the residual nor the skip branch dominates as the depth increases, and exact variance preservation is achieved if $r$ 8 and $r$ 9.

Block-specific weight variances are then set so that each block’s output is unit-variance and gradients are preserved:

Embedding table: $y = Wx$ 0
FFN block: Solve $y = Wx$ 1, e.g., $y = Wx$ 2, $y = Wx$ 3
Attention block: $y = Wx$ 4, $y = Wx$ 5, with layer-wise refinements for measured correlations (Kedia et al., 2024).

3. Prevention of Rank Collapse

Transformer depth increases risk of rank collapse, where all tokens collapse to identical representations due to growing inter-token correlation $y = Wx$ 6. DeepScaleLM’s dropout and scaling mechanism constrains the fixed-point of $y = Wx$ 7 below unity. Analytically, the update for correlation is:

$y = Wx$ 8

where $y = Wx$ 9 and $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ 0 express the attention and MLP contributions (see Appendix F). A sufficient dropout probability (e.g., $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ 1) brakes the collapse, keeping representations expressive across depth.

4. Pre-LN vs. Post-LN Architectures

Once DeepScaleLM enforces $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ 2 and the prescribed weight variances, both Pre-LN and Post-LN Transformers benefit:

Pre-LN DeepScaleLM: Forward and backward variances ( $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ 3, $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ 4) are exactly 1 at every layer, so no drift, explosion, or collapse arises.
Post-LN DeepScaleLM: The usual exponential drift in gradients is suppressed; both forward and backward variances remain at unity due to normalization before and after skip addition.

Both architectures thus obtain provable stability properties at initialization, correcting the usual pathologies associated with deep architectures.

5. Implementation Guidelines

The DeepScaleLM methodology is universally applicable across Transformer flavors (encoder-only, decoder-only, encoder–decoder; e.g., BERT, GPT, T5, ViT, speech models). The key steps are:

Initialize embeddings: $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ 5.
Initialize FFN weights: $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ 6, $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ 7.
Attention weights: $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ 8.
Residual scaling: $W_{ij} \sim \mathcal N(0, \sigma_W^2)$ 9, $\mathrm{Var}(y_i) = d\,\sigma_W^2\,\sigma_{\mathrm{in}}^2$ 0 for $\mathrm{Var}(y_i) = d\,\sigma_W^2\,\sigma_{\mathrm{in}}^2$ 1 total layers.
Practical hyperparameters: Dropout $\mathrm{Var}(y_i) = d\,\sigma_W^2\,\sigma_{\mathrm{in}}^2$ 2; learning-rate schedule: linear warmup (1–2%), cosine or inverse-sqrt decay, learning rates $\mathrm{Var}(y_i) = d\,\sigma_W^2\,\sigma_{\mathrm{in}}^2$ 3– $\mathrm{Var}(y_i) = d\,\sigma_W^2\,\sigma_{\mathrm{in}}^2$ 4 higher than baseline; gradient clipping at 1.0; Adam with $\mathrm{Var}(y_i) = d\,\sigma_W^2\,\sigma_{\mathrm{in}}^2$ 5.

No changes to the model graph or inference code are required; the scaling can be folded into the weights at initialization.

6. Empirical Performance and Robustness

Empirical benchmarks highlight the benefits of DeepScaleLM across tasks and model classes. With 4×–16× more layers (but fewer parameters), DeepScaleLM outperforms standard shallow baselines:

Model Type	Parameters	Metric	Baseline (Vanilla)	DeepScaleLM
BERT-style MLM 48×512	168M	PPL (3B tokens)	14.8*	13.1
BERT-style MLM 192×256	160M	PPL (3B tokens)	diverged	12.9
GPT-style 48×512 (Post-LN)	319M	PPL (8B tokens)	—	11.7
Speech Enc-Dec 48-24/128	28M	BLEU (MuST-C)	22.9 (12-6/256)	23.8
ViT 96×192 (90 ep)	—	ImageNet-1k top-1	76.5 (24×384)	77.2

*With careful LR tuning; without, vanilla diverges. DeepScaleLM trains "out of the box."

Further results demonstrate that DeepScaleLM-trained models improve downstream QA accuracy by 2–3 points and offer superior robustness on ImageNet-v2/R/Sketch (by 1–2 points). With 8-bit quantization, DeepScaleLM models show negligible perplexity increase (0.8 PPL), versus large degradation ( $\mathrm{Var}(y_i) = d\,\sigma_W^2\,\sigma_{\mathrm{in}}^2$ 627 PPL) for vanilla baselines (Kedia et al., 2024).

7. Significance and Context within Transformer Research

DeepScaleLM extends and synthesizes prior work on initialization and signal propagation, providing full closed-form variance recurrences and practical variance-preserving scalings for Transformers of arbitrary depth. This method prevents pathologies such as vanishing/exploding gradients and rank collapse without architecture modifications. The theoretical groundwork draws upon, and extends, lines of research by Glorot & Bengio (Xavier initialization), Poole et al. (dynamical isometry), De & Smith (skip initialization), Bachlechner et al. (ReZero), and Noci et al. (rank collapse). The initialization recipes and empirical validations make DeepScaleLM a foundational method for training ultra-deep, parameter-efficient Transformer models across diverse domains (Kedia et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepScaleLM.