Stacking Initialization in Deep Learning

Updated 27 February 2026

Stacking Initialization is a method that copies weights from a trained shallow model to rapidly initialize a deeper architecture.
It enhances training by accelerating convergence, preserving gradient signals, and reducing overall training tokens and time.
Variants like gradual stacking and MIDAS have demonstrated empirical success in LLM pretraining and sequential recommendation tasks by introducing beneficial inductive biases.

Stacking initialization refers to a class of model growth and deep-learning initialization techniques wherein parameters from a shallow or small “base” model are explicitly and systematically reused—typically by literal duplication—to initialize a much deeper or larger architecture. This approach, recently formalized for applications including LLM pretraining, residual networks, sequential recommendation, and more, leverages symmetry and representational redundancy among trained blocks to drastically accelerate training convergence, improve numerical stability, and even induce algorithmically meaningful inductive biases in the resulting models (Du et al., 2024, Saunshi et al., 2024, Wang et al., 2020). In most cases, stacking initialization is used to create a deeper network by concatenating identical or near-identical copies of lower-depth blocks, which are subsequently finetuned or retrained.

1. Mathematical Formulation and Variants

Stacking initialization operates predominantly at the architectural and parameter-duplication level. For a pretrained network $M$ with $L$ layers, the key operator—a “depthwise stacking” or block-stacking—creates a new model $\mathcal M$ of depth $g L$ , where

$\mathcal M(x) = \underbrace{M \circ M \circ \cdots \circ M}_{g~\text{times}}(x),$

with each layer in $\mathcal M$ initialized as a direct copy of its counterpart in $M$ :

$W_{tL+i}^{(\text{new})} = W_i^{(\text{small})}, \quad b_{tL+i}^{(\text{new})} = b_i^{(\text{small})}$

for $t = 0, ..., g-1$ and $i=1, ..., L$ (Du et al., 2024).

Important stacking variants include:

Gradual stacking: Model depth is increased in increments (by appending $b$ layers at each stage), with each deeper stage initializing the new layers by copying a selected subset (often the final or middle block) of layers from the previous stage.
MIDAS (Middle Gradual Stacking): At each growth step, the middle $b$ layers of the current model are duplicated and inserted, empirically found to confer a distinct inductive bias (Saunshi et al., 2024).
Iterative stacking for sequential recommenders: Full blocks (adjacent or cross) are duplicated to double depth, leveraging high cosine similarity among intermediate representations (Wang et al., 2020).

There is typically no additional scaling, normalization, or rescaling of the copied parameters at stack time; the original normalization (e.g., RMSNorm, LayerNorm) and activation scaling of the baseline model are preserved.

2. Theoretical Rationale

The central theoretical motivations behind stacking initialization are:

Capacity preservation and expansion: Stacking increases effective network depth/capacity without random initialization of additional layers. The duplicated blocks already implement a nontrivial transformation, ensuring the enlarged model does not start near a random function (Du et al., 2024).
Locality to optimal solution: Because the deeper model is a deterministic expansion of a well-trained, lower-depth solution, it starts “closer” to the optimal high-capacity solution than a randomly initialized model, often reducing optimization path-length and training time.
Accelerated optimization via Nesterov analogy: Stacking has been rigorously analyzed as a functional analogue of Nesterov’s accelerated gradient descent—especially in deep linear residual settings—yielding provable $\exp(-\Omega(T/\sqrt{\kappa}))$ convergence rates under suitable conditions, where $\kappa$ is the condition number (Agarwal et al., 2024).
Gradient signal preservation: Initializing new layers to nonrandom (already trained) parameters mitigates vanishing/exploding gradients typically encountered in very deep nets (Du et al., 2024).

For gradual or blockwise stacking, the theoretical mechanism extends to the induction of “soft parameter tying” and symmetry among blocks. Empirically, cosine similarity between repeated block parameters or outputs often exceeds 0.9, aligning stacked architectures with the algorithmic properties of looped or recurrent models (Saunshi et al., 2024).

3. Empirical Findings and Scaling Behavior

Stacking initialization consistently yields strong empirical acceleration in token/sample efficiency, wall-clock convergence, and benchmark accuracy when training large models. Notable results and practical findings include:

Model/Task	Depth/Size	Baseline Tokens	Stacking Tokens	Speedup	Accuracy Gain
LLM (G₍stack₎), (Du et al., 2024)	up to 7B params	300B	137B	54.6%	+2.1 points (Harness), 2–5% on ARC/PIQA
Sequential Recs (Wang et al., 2020)	Up to 100 layers	490 min	280 min	1.75×	+0.0006 MRR@5 (StackRec-32)
Transformer UL2 (Saunshi et al., 2024)	1B, 2B, 8B	fixed	fixed	1.24–1.26×	5–15% accuracy gain on reasoning/math

Stacking reduces required training tokens by up to 55% for decoder-only LLMs, matches or improves final checkpoint accuracy (despite using fewer update steps), and universally exhibits stable/faster convergence up to depths of several tens to over one hundred layers (Du et al., 2024, Wang et al., 2020, Saunshi et al., 2024). Ablation studies demonstrate that:

“Block-wise” copying (grouping all layers of a block together) is superior to alternate interpolated/staggered stacking.
Introducing a small amount of random noise to duplicated parameters (~20%) can further improve final accuracy, indicating strict function preservation is not required.
Stacking only the front layers yields little benefit; middle or back block copying is most effective (Du et al., 2024, Saunshi et al., 2024).

4. Inductive Bias and Generalization

A key emergent property of stacking initialization is its tendency to induce an algorithmically-relevant inductive bias in the resulting network. Recent analysis of models pretrained via stacking (and, especially, MIDAS) reveals:

Reasoning bias: Stacked models perform substantially better on reasoning-intensive tasks (open-book/closed-book QA, math word problems, synthetic algorithmic tasks), with consistent accuracy improvements relative to baseline/vanilla-pretrained models, even when validation loss/perplexity is slightly inferior (Saunshi et al., 2024).
Block similarity and looped-model analogy: Repeated stacking causes blockwise parameter and output similarity to rise above 0.9, mimicking looped/shared-weight architectures such as Universal Transformers and ALBERT. This “iterative computation” bias is conjectured to be the mechanism underlying improved reasoning and retrieval of algorithmic primitives in LLMs (Saunshi et al., 2024).
Resilience to overfitting/regularization: StackRec and multilevel stacking display reduced sensitivity to random seeds and hyperparameter settings, a phenomenon ascribed to regularization and redundancy induced by parameter copying (Wang et al., 2020, Cyr et al., 2019).

5. Practical Implementation and Guidelines

Practitioners should attend to several usage and tuning conventions derived from systematic ablation and scaling studies:

Growth factor: Repeating base blocks with $g \in [2, 4]$ is optimal for LLMs and recommender systems, with $g=4$ recommended as a robust default (Du et al., 2024).
Growth timing: The optimal point for stacking (number of tokens $d$ before growth) is determined by a fitted “Iso-FLOP” law:

$\log_{10}(d) = a \log_{10}(N) + \frac{b}{\log_{10}(C)} + c$

with $a=0.88, b=163.27, c=-5.74$ , where $N$ is the target model parameter count and $C$ the pretraining FLOP budget (Du et al., 2024).

Stacking variant: Both adjacent and cross-block stacking are viable; function preservation and quick fine-tuning are possible with exact parameter copying.
Noise addition: Injecting small random perturbations to stacked weights may further regularize and improve generalization.
No special normalization: Use the original normalization layers as configured in the base model; no extra scaling is needed.

Limitations

Stacking requires an initial phase of smaller-model training, which can be overhead unless amortized over sufficiently large scales.
Strict function preservation is not guaranteed beyond stack time, though this has not impaired empirical stability in tested regimes.
Applicability and optimal parameterization may vary for architectures beyond decoder-only transformers and residual models (Du et al., 2024, Saunshi et al., 2024).
MIDAS and similar block-copying approaches are most effective when there is blockwise homogeneity and parameter similarity among trained blocks; in highly heterogeneous architectures, stacking gains may diminish (Wang et al., 2020).

6. Broader Connections and Theoretical Insights

Stacking initialization unifies and extends several traditions in neural network growth and initialization:

ODE-based control and nested iteration: Multilevel schemes refine network depth in analogy to time-discretized optimal control problems, with parameter interpolation between shallow and deep representations guiding initialization (Cyr et al., 2019).
Layernorm + residual everywhere-criticality: Proper stacking and ordering of normalization and skip connections yields “everywhere‐critical” initialization, guaranteeing gradient propagation for arbitrary parameterizations (Doshi et al., 2021).
Boosting/ensemble acceleration: Stagewise stacking is directly analogous to accelerated functional-gradient boosting methods, with provable accelerated convergence for quadratic and strongly convex objectives (Agarwal et al., 2024).

These connections clarify why stacking initialization is consistently effective across model classes and data domains, and explain its observed impacts on trainability, generalization, and reasoning.

7. Future Directions and Open Problems

Major open research challenges include:

Formalizing the conditions under which stacking preserves or amplifies parameter similarity and when the looped-model analogy best applies (Saunshi et al., 2024).
Extending stacking initialization to encoder–decoder architectures, multimodal models, or systems with significant block heterogeneity.
Exploring alternative nonuniform growth/splitting/partial-sharing operators to refine or alter the “inductive bias” introduced by stacking (Saunshi et al., 2024).
Integrating stacking initialization into pre-LN/post-LN configurations with nuanced scaling/normalization strategies, especially in unstable very-deep regimes.

The continued mathematical and empirical study of stacking initialization promises to further accelerate the efficient training of deep, large-scale neural networks and to shed light on the algorithmic properties of iterative computation in neural sequence models.