Growing Transformers: Scalable Architectures

Updated 4 March 2026

Growing Transformers are adaptive architectures that incrementally scale model capacity along axes such as depth, width, and heads while preserving pre-learned functions.
Methodologies including function-preserving expansion, learngene-based growth, and layer-wise modular construction enable efficient, resource-aware training and continual model improvement.
These techniques improve practical deployment by reducing redundant computation and storage, yielding benefits in deep vision, language processing, and multi-task performance.

Growing Transformers encompass a broad set of methodologies that enable scalable, flexible, and efficient expansion of Transformer architectures along various axes—including depth, width, attention heads, and functional modules—during the training or inference lifecycle. These approaches aim to preserve accumulated knowledge, reduce redundant computation and storage, and adapt model capacity to application-specific or resource-constrained scenarios. Notable lines of work include composable function-preserving expansions, layer-wise and modular growth, abstract "learngene" parameterizations, depth-scaling for expressivity, and continuous architectural growth in deep vision Transformers.

1. Function-preserving Expansion and Composability

"Composable Function-preserving Expansions for Transformer Architectures" (Gesmundo et al., 2023) establishes a rigorous framework for incrementally growing Transformers along six independent architectural axes with provable preservation of the pre-expansion input-output function:

MLP internal width: Output-equivalence is guaranteed by zero-initializing the added output projections of expanded hidden units.
Attention head number: Zero-initializing new rows in the output projection ensures new heads do not perturb model function until trained.
Per-head value dimension: Identity preservation is achieved by padding the value and output projection weights, with new subspaces multiplied by zeros initially.
Key/query dimension: Rescaling and zero-padding preserve softmax and attention affinity, preventing spurious changes.
Hidden state dimension: All new coordinates across MLP, MHA, embeddings, and layer norm are zero-padded, ensuring the computation remains restricted to the original subspace.
Layer depth (N→N+1): New layers are inserted with zero-initialized MLP and MHA output weights, functioning as the identity map on the residual path.

These transformations are independent and commutative, enabling synchronous application across axes and supporting staged or bulk model scaling to larger capacity. The implementation attaches only new parameters to memory, maintaining the original optimizer state, and permits continued training from any checkpoint without optimization artifacts. Empirical validation confirms exact logit and gradient matching before and after expansion on BERT-like models. This approach forms an operational backbone for efficient, compute-aware scaling in large-scale model pretraining and continual learning (Gesmundo et al., 2023).

2. Linear Parameter Expansion: Learngene-based Growth

"Transformer as Linear Expansion of learnGene" (TLEG) (Xia et al., 2023) introduces the notion of a "learngene": a compact, knowledge-intensive core consisting of two full-layer parameter tensors, $\theta_a$ and $\theta_b$ , whose linear interpolation parameterizes arbitrarily deep Transformer stacks:

$\theta_{l} = \theta_b + \frac{l-1}{L}\,\theta_a \quad\text{for}\quad l=1,\ldots,L$

where $L$ is the desired auxiliary (training) network depth. With this formulation, only $\approx$ two layers' worth of parameters are learned during a one-time soft distillation stage, using a composite loss over cross-entropy and teacher KL divergence.

Once the learngene is trained, descendant Transformers ("Desc-Net") of any target depth $L'$ are instantly synthesized by linearly "growing" each layer:

$\theta_{l'} = \theta_b + \frac{l'-1}{L'}\,\theta_a \quad\text{for}\quad l'=1,\ldots,L'$

All architectural modules (MSA, MLP, layer norm, etc.) are initialized via this recipe. The linear constraint on parameters is lifted after instantiation, allowing subsequent fine-tuning.

Empirical results demonstrate TLEG achieves on-par or superior performance versus independently-trained models at a fraction ( $\approx2\times$ ) of pre-training cost on ImageNet-1K and provides substantial parameter ( $19\times$ ) and storage compression for depth-flexible model families. In transfer scenarios, TLEG-initialized models yield +6.87% (iNat2019) and +7.66% (CIFAR-100) accuracy improvement over standard initializations (Xia et al., 2023). The approach offers a mechanism for one-shot expansion and reconfiguration of models, substantially reducing redundant storage and compute investment in model deployment pipelines.

3. Depth-scaling and Expressivity: Logarithmic Growth

"A Little Depth Goes a Long Way" (Merrill et al., 5 Mar 2025) situates growing Transformers in the context of expressivity barriers and complexity theory. The key result is that computational depth, not width, is the principal bottleneck for sequential reasoning and multi-hop algorithmic tasks:

Universal transformers with depth $D(n) = \Theta(\log n)$ (where $n$ is input length) recognize all regular languages and solve reachability (graph connectivity), with provable L-uniform constructions.
Constant depth transformers, regardless of polynomial width, are confined to TC $^0$ and cannot express even simple counter-based automata or handle NL-complete tasks.
Empirical measurements in sequence-tracking environments confirm that $D\approx5\cdot\log_2 n$ layers suffices for $95\%$ token-wise accuracy, while width-only scaling incurs exponential growth.

Depth acts as a cost-effective lever for increased model reasoning capacity. Layer-wise expansion or dynamic-depth unrolling allows models to maintain state-tracking and connectivity reasoning as context grows. Fixed shallow models ( $<12$ layers) fail to generalize beyond $n\approx100$ –$200$ in such settings (Merrill et al., 5 Mar 2025). This establishes theoretical and empirical motivation for depth-growing methodologies in Transformer stacks as a primary axis of scalability.

4. Layer-wise and Modular Construction

"Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate" (Bochkov, 8 Jul 2025) proposes a constructive paradigm enabled by non-trainable, universal token embeddings derived from Unicode glyphs. This substrate supports two principal forms of growth:

Layer-wise constructive training: Layers are appended sequentially, with all previous layers frozen and only the new block trained before integration. This monotonically improves test performance and reasoning (e.g., SQuAD F1 from ~1.2% at layer 1 to 5.55% at layer 6), and mitigates catastrophic forgetting associated with monolithic updates.
Zero-modification modular composition: Multiple independently-trained expert models, sharing frozen embeddings and output heads, can be merged post-hoc via logit averaging, enhancing aggregate task performance (e.g., MMLU accuracy EN+RU MoE rises +1.6 p.p. above individual models) without retraining or interference.

Both approaches decouple parameter growth from network instability and catastrophic forgetting, supporting continual learning and collaborative, democratized model development where modules can be shared, amassed, and composed as-needed (Bochkov, 8 Jul 2025).

5. Depth Scaling in Vision: LayerScale and Optimization Strategies

In the context of deep vision Transformers, "Going deeper with Image Transformers" (Touvron et al., 2021) identifies critical architectural and optimization interventions required for stable training at large depths (up to 100+ layers):

LayerScale: Each residual branch in SA/FFN is modulated by a learnable diagonal matrix initialized to $\varepsilon\ll1$ , which prevents the early dominance of deep skip connections and maintains stable gradient flow.
Class-Attention Head: Patch tokens are processed fully before a dedicated decoder-stage class token aggregates global information. This relieves multi-objective pressure from Self-Attention stages, resulting in more effective summarization and accuracy gains.
Stochastic depth and tailored LR schedules: Uniform drop rates and careful warmup are mandatory for ultra-deep stacks.

Empirical results show that, with these methods, models avoid the early saturation/failure mode and scale monotonically in accuracy with depth (e.g., DeiT-S, 80.5% at 12L baseline to 82.9% at 36L with LayerScale). For practical deployment, models may be grown in depth at training stalls, refining model capacity at each increment without destabilizing optimization (Touvron et al., 2021).

6. Practical Considerations and Guidelines

Framework-specific best practices emerging from the surveyed literature include:

Preserve knowledge by zero-initializing only prescribed parameter subsets during expansion (Gesmundo et al., 2023).
Apply staged or parallel expansion depending on capacity and plateau detection; expand along the axis (depth/width/heads) that bottlenecks progress.
Store minimal parameter cores: e.g., TLEG requires only two layer parameter sets for initialization of an entire depth family (Xia et al., 2023).
Select depth according to context length via $D(n)\approx5\log_2 n$ scaling for sequential tasks and state-heavy processing (Merrill et al., 5 Mar 2025).
Use constructive/layerwise training or composition to incorporate new expertise or correct drift without retraining the entire model (Bochkov, 8 Jul 2025).

Growing Transformers, in their various instantiations, underpin a movement away from monolithic, static architectures toward adaptive, resource-efficient, and composable model design, with tangible benefits in empirical performance, scalability, and collaborative research efficiency.