Transformer Initialization

Updated 23 April 2026

Transformer Initialization is the process of setting model weights to maintain stable signal and gradient variances, crucial for effective training.
Structured techniques, such as variance-preserving and depth-scaled methods, encode inductive biases that influence whether models favor reasoning or memorization.
Advanced approaches like transfer initialization and Lipschitz-constrained setups enhance convergence and scalability in deep or specialized transformer architectures.

A transformer’s initialization strategy determines the inductive bias, signal propagation, and learning dynamics at the onset of training, directly affecting everything from convergence speed and stability to the regime (memorization vs. reasoning) in which the model operates. The design and selection of transformer initialization is thus an active and diversified research area encompassing variance-preserving schemes, structural and transfer-based initializations, and task-specific or architecture-dependent methods.

1. Mathematical Principles of Transformer Initialization

For a transformer with $L$ layers, hidden dimension $d$ , and projection or feed-forward weights $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ , the overarching objective of initialization is to keep both the forward signal variance and backward gradient variance bounded over depth. This requires careful selection of the variance, or occasionally spectral norm, of each parameter group based on the architecture’s recursion relations.

Key variance propagation equations at initialization are: $\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ for linear layers, with generalization to nonlinear blocks by correcting for the second moment $\mathbb{E}[F(x)^2]$ of the nonlinearity $F$ . For MLPs with ReLU or GELU, the critical constants are $c_{\phi} = \mathbb{E}[\phi(z)^2]/\mathrm{Var}[z]$ and $d_{\phi} = \mathbb{E}[\phi'(z)^2]$ ; fan-in scaling is then given by $\sigma_w^2 = 2/\mathrm{fan\_in}$ (He/Kaiming) for ReLU and $\sigma_w^2 \approx 1/(0.475\,\mathrm{fan\_in})$ for GELU (Han, 10 Oct 2025).

In multi-head self-attention, Q/K/V projections follow a similar fan-in/fan-out scaling, often defaulting to $d$ 0 (Li et al., 5 Feb 2026), while the output projection $d$ 1 may receive additional scaling (e.g., $d$ 2) in deeper transformers to stabilize residual magnitudes.

LayerNorm or pre-LayerNorm placement further regularizes activation variances, promoting power-law (rather than exponential) scaling of the averaged partial Jacobian norm (APJN) (Alekseev, 13 Apr 2026). Normalization-free transformers require stricter control on initialization scale to avoid exponential or stretched-exponential signal amplification.

2. Structured Initialization: Encoding Inductive Bias

Standard initialization (e.g., Xavier/Glorot, He/Kaiming) treats weights as i.i.d. elements without architectural structure. However, inductive bias can be encoded at the initialization phase to enhance data efficiency or facilitate learning in challenging regimes, particularly for small datasets or data-limited tasks.

In vision transformers (ViTs), crafting $d$ 3, $d$ 4 matrices per head to induce a softmax attention map mirroring a convolutional impulse filter— $d$ 5—imposes spatial locality akin to CNNs (Zheng et al., 26 May 2025, Zheng et al., 2024, Zheng et al., 2024). The procedure computes

$d$ 6

followed by SVD and normalization, with hyperparameters typically $d$ 7. The attention maps constructed in this manner initialize the transformer to mimic local, shift-invariant receptive fields without modifying its architecture. This locality bias sharply increases small-data generalization and convergence (by 2–25% accuracy improvements on tasks like CIFAR, STL-10, Flowers, and Pets), while maintaining or slightly improving performance on large-scale datasets such as ImageNet-1K (Zheng et al., 26 May 2025).

The structural approach can be extended to Swin Transformers (relative position biases), MLP-Mixer (factorizing token mixers to match convolution), and batch-specific setups.

3. Initialization Scale and Learning Regimes: Reasoning vs. Memorization

The scale parameter $d$ 8 in the standard deviation $d$ 9 for weight initialization induces a bifurcation in training behavior: large $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ 0 ( $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ 1) biases the transformer toward reasoning-driven, compositional, low-complexity solutions, while small $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ 2 ( $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ 3 or larger) pushes the model toward symmetric, memorization-centric solutions (Zhang et al., 2024, Yao et al., 5 Feb 2025).

Empirical phase diagrams reveal a critical $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ 4 for $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ 5– $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ 6 transformer blocks; for $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ 7, the model learns to generalize via inference over compositional primitives, while for $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ 8, it memorizes seen mappings with poor out-of-distribution generalization. The embedding and early attention matrices condense to low-rank, interpretable structures only in the inferential regime (Zhang et al., 2024, Yao et al., 5 Feb 2025). These findings are validated across both synthetic compositional benchmarks and real-world datasets, providing actionable guidance for setting initialization variance to match the desired inductive learning regime.

4. Specialized Initialization Techniques for Deep and Variant Architectures

Transformers with large depth and modern variants (normalization-free, dynamic nonlinearities, etc.) encounter nontrivial optimization pathologies unless initialization is carefully tuned.

Depth-Scaled Initialization (DS-Init): To counteract vanishing gradients in deep post-norm transformers (residuals + LayerNorm), DS-Init scales parameter variance in layer $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ 9 by $\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ 0:

$\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ 1

yielding stable gradient flow and enabling training up to 24 layers (Zhang et al., 2019).

Lipschitz-Constrained Parameter Initialization (LCPI): All projection and feedforward matrices $\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ 2 are initialized so that $\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ 3 (e.g., $\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ 4), ensuring every sublayer is $\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ 5-Lipschitz at outset (Xu et al., 2019). This prevents vanishing/shrinking of residuals post-LayerNorm for arbitrarily deep stacks.
Normalization-Free Transformers: In architectures without LayerNorm (e.g., dynamic $\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ 6-like nonlinearities), initial weight scales $\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ 7 must be reduced and managed carefully to avoid subcritical stretched-exponential gradient amplification across depth (Alekseev, 13 Apr 2026). APJN analysis provides explicit recurrence relations and scaling laws.
Variance-Preserving Initialization for KANs in Transformers (KAT): When Transformer MLPs are replaced by Kolmogorov-Arnold Networks (with learnable rational activations), weight scaling follows $\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ 8, linking the initialization tightly to the activation's second moment (Yang et al., 2024).

5. Transfer and Meta-Initialization: Subcloning and Expansibility

Modern application scenarios often demand transformers at multiple sizes or with limited resources for pretraining.

Weight Subcloning: To initialize a target transformer of reduced depth/width, neuron importance ranking and pruning identify the most salient output neurons for each layer in a large, pretrained “parent” model. The weights, reordered and scaled by $\mathrm{Var}[y_j] = d_{\mathrm{in}}\;\mathrm{Var}[w]\;\mathrm{Var}[x] \stackrel{!}{=}\; \mathrm{Var}[x] \implies \mathrm{Var}[w] = \frac{1}{d_{\mathrm{in}}}$ 9, are directly copied to the destination model, with Transformer blocks pruned by omitting blocks (preferably from the middle). This method preserves both variance structure and channel identity, yielding $\mathbb{E}[F(x)^2]$ 0 faster convergence in language and vision models (Samragh et al., 2023).
Linear Expansion of “Learngene” (TLEG): Observation of linearly-varying principal component projections across layers motivates representing all $\mathbb{E}[F(x)^2]$ 1 layer-parameters as interpolations between two learned “basis-layers” $\mathbb{E}[F(x)^2]$ 2. After a soft-distilled pretraining of an auxiliary network whose layers are linearly expanded from these bases, transformers of arbitrary depth $\mathbb{E}[F(x)^2]$ 3 can be initialized as $\mathbb{E}[F(x)^2]$ 4 and fine-tuned for specific tasks. This achieves drastic reductions in pretraining and parameter storage cost while providing high transferability and competitive accuracy across downstream tasks (Xia et al., 2023).

6. Impact of Initialization on Model Bias and Identity

Contrary to the common assumption that random initialization confers only “neutral” priors to transformer models, mechanistic analysis shows that untrained transformers exhibit strong, seed-dependent structural biases. Two interacting forces—representation contraction via MLP nonlinearities and amplification by self-attention—cause output token probabilities to be highly non-uniform and persistently coupled to the initialization seed. This identity persists throughout training, enabling “birth-to-life” fingerprinting (SeedPrint), which robustly distinguishes between models initialized with different seeds even after extensive fine-tuning (Li et al., 5 Feb 2026). Additionally, these seed-dependent contractions explain the persistent “attention-sink” phenomena (e.g., variance concentration at the first token), which can be mitigated at initialization by explicit variance alignment interventions.

7. Practical Guidelines and Comparative Synthesis

A multifaceted summary of best practices is as follows:

Initialization Type	Recipe / Rationale	When to Use / Outcome
Variance-preserving (He, Xavier)	$\mathbb{E}[F(x)^2]$ 5 (ReLU), $\mathbb{E}[F(x)^2]$ 6 (GELU) (Han, 10 Oct 2025)	Default for stable signal and gradient propagation in standard architectures
Structured Impulse	$\mathbb{E}[F(x)^2]$ 7 engineered to produce impulse (convolutional) softmax maps (Zheng et al., 26 May 2025)	Vision transformers on small data; outperforms standard, mimetic, or random init
Depth-Scaled (DS-Init)	$\mathbb{E}[F(x)^2]$ 8 Uniform( $\mathbb{E}[F(x)^2]$ 9) (Zhang et al., 2019)	Deep post-norm transformers; enables depth $F$ 012
Lipschitz-Constrained	$F$ 1 Uniform( $F$ 2) (Xu et al., 2019)	Ensures sublayer Lipschitzness; deep encoders/decoders
Reasoning/Memory Bias	$F$ 3, set $F$ 4 for reasoning, $F$ 5 for memorization (Zhang et al., 2024, Yao et al., 5 Feb 2025)	LLMs, compositional tasks: tune for desired inductive regime
Subcloning	Pruning+scaling weights from a larger pretrained parent (Samragh et al., 2023)	Rapid training of scaled-down models; resource scaling, transfer
Linear Expansion/TLEG	Each layer $F$ 6, learned via distillation (Xia et al., 2023)	Elastic initialization for arbitrary depth; parameter efficiency
Variance-preserving KAN	$F$ 7 (Yang et al., 2024)	KATs/KAN-based architectures; stable training of learnable activation models
Positional Variance Calibration	Multiply $F$ 8 (Li et al., 5 Feb 2026)	Remove attention sinks; improved token variance balance

Parameter initialization thus operates beyond mere numerical stability, providing a mechanism to induce inductive biases, control learning regimes, fingerprint model identity, facilitate transfer, and optimize architectural scaling—all validated empirically with state-of-the-art results across language, vision, and multimodal benchmarks.

See (Zheng et al., 26 May 2025, Zhang et al., 2024, Yao et al., 5 Feb 2025, Han, 10 Oct 2025, Li et al., 5 Feb 2026, Zhang et al., 2019, Xu et al., 2019, Samragh et al., 2023, Xia et al., 2023, Zheng et al., 2024, Zheng et al., 2024, Dinan et al., 2023, Makkuva et al., 2024, Alekseev, 13 Apr 2026, Yang et al., 2024, Geerenstein et al., 2023) for full technical and empirical details.