Signal Propagation in Transformers

Updated 31 January 2026

Signal propagation in transformers is the study of how activations and gradients evolve through deep layers, revealing key failure modes like rank collapse and gradient instability.
The analysis uses deterministic recurrences and spectral properties to track activation norms, cosine similarities, and the gradient Lyapunov exponent for stability assessment.
Practical strategies such as residual scaling, critical initialization, and spectral clipping are essential for ensuring robust signal flow and trainability in deep transformer models.

Signal propagation in transformers refers to the mathematical and dynamical behavior of input, hidden, and output representations as they are mapped through the blocks of a (deep) transformer architecture. The study of signal propagation characterizes how information is preserved, distorted, or degraded—both in the forward pass (activation statistics) and backward pass (gradient statistics)—as a function of depth, width, architectural choices (e.g., residual scaling, layer normalization, initialization variance), and attention-specific pathologies such as rank collapse or attention-score condensation. Recent theoretical advancements provide precise recurrence relations, stability criteria, and phase diagrams for predicting and controlling various failure modes of deep transformers at initialization and during early training.

1. Forward and Backward Signal Dynamics in Transformers

Let $X^t \in \mathbb{R}^{n \times d}$ denote the token representations at layer $t$ , with $n$ tokens and feature dimension $d$ . Signal propagation is analyzed by tracking the joint geometry (typically, Gram statistics) of the tokens as they are mapped through self-attention and MLP blocks, with interleaved residual connections and normalization.

The core objects of study for forward propagation are:

Mean square norm: $q^t = \frac{1}{n} \sum_i \langle X^t_i, X^t_i \rangle$
Mean pairwise dot product: $p^t = \frac{2}{n(n-1)} \sum_{i<j} \langle X^t_i, X^t_j \rangle$
Cosine similarity: $\rho^t = p^t / q^t$

For the backward path, the main quantity is the Frobenius norm of the Jacobian of the loss with respect to parameters and activations, e.g., $\left\| \frac{\partial L}{\partial X^t} \right\|_F^2$ , which tracks the flow of gradients.

Deep signal-propagation theory establishes deterministic recursions for these objects under simplifying statistical assumptions, such as permutation-invariant random initialization, large-width limit, and independent Gaussian weights (Cowsik et al., 2024, Dinan et al., 2023, Kedia et al., 2024).

For example, for permutation-symmetric initializations, the update map through a transformer block (attention plus MLP, each with their own residual branch) can be written as coupled polynomial or nonlinear recursions on $(q^t, p^t)$ (Cowsik et al., 2024): $(q^{t+1}, p^{t+1}) = F_{\text{MLP}}(F_{\text{Att}}(q^t, p^t))$ with closed-form expressions for the effect of self-attention and MLP in the large- $d$ limit.

For gradient propagation, the squared norm increases (or decreases) exponentially in depth with rate governed by a "gradient Lyapunov exponent" $\lambda_g$ , yielding the scaling law: $\mathbb{E} \left\| \frac{\partial X^L}{\partial X^0} \right\|_F^2 \approx \exp(2 \lambda_g L)$ where $L$ is the number of layers.

2. Failure Modes: Rank Collapse, Entropy Collapse, and Gradient Instability

Two principal pathologies of signal propagation in deep transformers have been theoretically characterized (Noci et al., 2022, Giorlandino et al., 30 May 2025, Saada et al., 2024):

Rank collapse: As representations propagate through repeated self-attention layers (without appropriate residual scaling), all tokens collapse to the same direction in representation space, i.e., $\rho^t \to 1$ , and the token matrix becomes rank-one. The mechanism is spectral: for random-initialized softmax attention, the attention matrix has a dominant eigenvector aligned with the uniform vector, and repeated application exponentially destroys token diversity (Saada et al., 2024, Noci et al., 2022).
Entropy collapse: At large query/key variance (high attention "inverse-temperature" $\beta$ ), the attention matrix becomes sharply peaked, concentrating all weight on a few tokens per row. Shannon entropy per attention row saturates at $H(\beta) = O(1)$ , instead of the ideal $H(\beta) \sim \log T$ (sequence length). This leads to instability and nontrainable regimes (Giorlandino et al., 30 May 2025).
Gradient explosion/vanishing: In the absence of careful initialization and normalization, the Jacobian norm explodes or vanishes with depth, yielding untrainable or unstable networks (Kedia et al., 2024, Dinan et al., 2023).

The table below summarizes these phenomena and their main causes:

Failure Mode	Mechanism	Main Parameter(s)
Rank collapse (depth/width)	Spectral gap in attention, softmax	Query/key variance below threshold
Entropy collapse	Attention-score condensation (REM)	Query/key variance above threshold
Gradient explosion/vanishing	Accumulation or decay in Jacobians	Weight variance, residual scaling

Rank collapse in width arises from the spectral gap in the attention Markov matrix as sequence length increases ( $T \gg 1$ ), causing almost all token variance to be projected along the uniform vector (Saada et al., 2024).

3. Theoretical Recurrences and Critical Regimes

Signal propagation dynamics can be reduced to critical discrete-time dynamical systems coupled via initialization and architectural hyperparameters. The key recurrences (Cowsik et al., 2024, Giorlandino et al., 30 May 2025, Dinan et al., 2023, Kedia et al., 2024):

Forward recurrences for means/covariances (or kernels) of activations:

$G^{(l+1)}_{tt'} = G^{(l)}_{tt'} + \Delta G^{(l)}_{tt'}$

where $\Delta G$ is the block-specific update (attention or MLP).
Backward recurrences for empirical NTK kernels or Jacobian statistics.
Order and gradient Lyapunov exponents:
- Angle Lyapunov exponent $\lambda_a$ : Controls whether token representations collapse ( $\lambda_a < 0$ ) or remain diverse ( $\lambda_a > 0$ ).
- Gradient Lyapunov exponent $\lambda_g$ : Determines whether gradients vanish ( $\lambda_g < 0$ ) or explode ( $\lambda_g > 0$ ) in deep networks. Criticality ( $\lambda_a \approx 0$ , $\lambda_g \approx 0$ ) enables stable propagation and correlates strongly with trainability and the achievable test loss (Cowsik et al., 2024).

The presence of residual connections with appropriate scaling ( $\alpha^2 + \beta^2 = 1$ ), as well as LayerNorm, ensures that both the mean and variance of activations and gradients can be held constant across depth, preventing both collapse and explosion (Kedia et al., 2024, Dinan et al., 2023).

4. Architectural and Initialization Solutions

Multiple approaches systematically address pathological signal propagation:

Residual normalization/scaling: Scaling the residual connections as $1/\sqrt{L}$ with depth $L$ or enforcing $\alpha^2 + \beta^2 = 1$ per skip branch ensures O(1) propagation of norms and suppresses rank collapse (Noci et al., 2022, Kedia et al., 2024).
Critical initialization for attention/statistics: Initializing query and key weights with variance $\sigma_Q^2 = \sigma_K^2 = (\beta \sqrt{\log T})/d$ , and ensuring $\beta < \sqrt{2}$ (phase boundary for condensation), guarantees the forward signal is neither destroyed (rank collapse) nor locked (entropy collapse) (Giorlandino et al., 30 May 2025).
Spectral correction of attention: Subtracting the uniform rank-one eigencomponent from the attention matrix at each layer ("eigenvalue-clipping") restores width-wise rank and controls downstream gradient escalation (Saada et al., 2024).
Kernel shaping without skips/norms: Directly "engineering" the sequence of NNGP kernels or setting attention matrices to preserve target diversity (e.g., using prescribed Cholesky factors) allows deep vanilla transformers (no skips, no normalization) to propagate signal for many layers without collapse, albeit with slower convergence (He et al., 2023).

Empirical studies corroborate these implementations, with models in the "trainable" region of the critical diagrams converging efficiently and exhibiting minimal loss, while models in collapse regimes either diverge or stagnate (Giorlandino et al., 30 May 2025, Cowsik et al., 2024).

5. Unified Trainability Phase Diagrams

A key development is the explicit mapping of trainability as a function of key architectural and initialization hyperparameters. The (query/key variance, skip scale) phase diagram exposes three operational domains (Giorlandino et al., 30 May 2025):

Trainable regime: Both query/key variance below condensation threshold (entropy collapse) and residual skip above critical scale (rank collapse) are satisfied.
Rank collapse zone: Variance is moderate, but skip scale is too small.
Entropy collapse zone: Variance is so large that attention locks onto O(1) tokens per row, rendering the effective attention entropy constant and breaking information flow.

Critical boundaries are characterized analytically (e.g., $\sigma_a = \beta \sqrt{\log T}$ , with $\beta_c = \sqrt{2}$ ), and residual scales required for stability grow slowly but predictably with model depth. With precise tuning, deep transformers ( $L \sim 100-1000$ ) with unit-variance signal/gradient propagation are practical (Giorlandino et al., 30 May 2025, Kedia et al., 2024).

6. Empirical Validation and Practical Guidelines

Theoretical prescriptions predict empirical behavior across architectures and tasks. DeepScaleLM (Kedia et al., 2024) and kernel-shaping approaches (He et al., 2023) have established that:

Very deep transformers (hundreds to a thousand layers) are trainable given strict adherence to initialization and residual-scaling constraints.
Standard initializations often (but not always) fall in the safe regime for moderate depth, but naive scaling leads to collapse for large T or L.
Attention-only or attention-MLP blocks without normalization require tailored initialization of the attention kernels to avoid rapid degeneration (He et al., 2023).
Adaptive learning rates are often necessary to compensate for the mismatch in variance and power-law dependence of query/key and value gradients (Noci et al., 2022).

The following table highlights critical choices for initialization:

Parameter	Recommended Value/Scaling	Reference
Query/Key variance	$\sigma_Q^2 = \sigma_K^2 = (\beta \sqrt{\log T})/d$ , $\beta < \sqrt{2}$	(Giorlandino et al., 30 May 2025)
Skip/residual scaling	$\alpha^2 + \beta^2 = 1$ ; $\alpha_{\text{crit}}(L) \sim 1.3-1.5$ at $L=60$	(Giorlandino et al., 30 May 2025)
Layer normalization	Pre-LN for extreme depth ( $L > 100$ )	(Kedia et al., 2024)
Attention spectral clipping	Remove uniform rank-one part per layer	(Saada et al., 2024)
Adam/LR schedule	Per-block adaptation or analytic scaling	(Dinan et al., 2023)

Empirical experiments on language modeling, masked-LM, and vision classification validate the theory: models initialized in the predicted trainable regimes converge with lower test loss, while those outside suffer stagnation, instability, or divergence (Giorlandino et al., 30 May 2025, Cowsik et al., 2024, Kedia et al., 2024, He et al., 2023).

7. Connections, Extensions, and Outlook

Theoretical advances in signal propagation in transformers connect and extend previous results from mean-field theory of MLPs and CNNs to architectures with non-local coupling, softmax nonlinearity, and nontrivial symmetry imposed by attention. Tools from random matrix theory rigorously diagnose and quantify spectral collapse unique to transformers (Saada et al., 2024). The phase structure of trainability—with Lyapunov exponents as predictive metrics—unifies disparate pathologies: rank collapse, entropy collapse, and gradient explosion/vanishing (Cowsik et al., 2024, Giorlandino et al., 30 May 2025).

These insights inform architectural innovations (e.g., very deep models, norm-free kernels), robust initialization, and optimizer design. They also clarify the theoretical limits, revealing that, although pathological collapse in depth is generic for matrix-multiplication architectures, practical remedies allow the engineering of extremely deep, stable, and robust transformer networks.

Further work is ongoing to extend the dynamical analysis to training-time signal evolution, non-Gaussian and structured inputs, and to deeper integration with hardware-aware constraints and compression. The analytic tools established for transformers serve as paradigms for future research on more complex attention-based and memory-augmented architectures.

Key References

(Saada et al., 2024) Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers
(Giorlandino et al., 30 May 2025) Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
(Cowsik et al., 2024) Geometric Dynamics of Signal Propagation Predict Trainability of Transformers
(Noci et al., 2022) Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
(Dinan et al., 2023) Effective Theory of Transformers at Initialization
(Kedia et al., 2024) Transformers Get Stable: An End-to-End Signal Propagation Theory for LLMs
(He et al., 2023) Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation