Layerwise Variance Decomposition
- Layerwise variance decomposition is a method that separates the total variance in neural networks into sample variance and network-dependent bias, clarifying the role of randomness.
- It uses both theoretical infinite‐width analysis and empirical verification to show that sample variance decays while bias increases with network depth.
- Techniques like BatchNorm and analyses in in-context learning demonstrate how controlling variance impacts training speed, robustness, and representation quality.
Layerwise variance decomposition is a principled methodology for analyzing the sources and dynamics of variation in neural network representations across depth, focusing on how randomness arising from both network parameters and sample inputs propagates through model layers. This concept enables precise separation of the contributions of these distinct sources of randomness or uncertainty, yielding fundamental insight into initialization behavior, feature collapse, training speed, and representation evolution in both feedforward and attention-based architectures. Recent theoretical and empirical studies further extend the framework to understand compressive and expansive phases in LLMs, linking the bias–variance structure of in-context learning to sample efficiency, robustness, and emergent representation geometry (Luther et al., 2019, Jiang et al., 22 May 2025).
1. Decomposition of Layerwise Variance in Feedforward Networks
In feedforward ReLU multilayer perceptrons (MLPs) with standard Kaiming (He) initialization, the pre-activation at neuron in layer is given by
where is the ReLU, weights are i.i.d., and (Luther et al., 2019).
Variance can be analyzed in two distinct ways:
- Total variance : Averaged over both random network initializations and randomly drawn samples.
- Sample variance and sample mean squared : For a fixed initialized network, computed over random inputs, and then averaged over networks.
Crucially,
i.e., total variance decomposes cleanly into sample variance (over data) and mean-square (network-dependent means), revealing how much variability is due to sample differences versus architectural randomness (Luther et al., 2019).
2. Infinite-Width Analysis and Sample Variance Decay
In the infinite-width limit, analytic results are obtained via mean-field theory. The propagation of input similarity across layers gives rise to a recursive map:
with denoting the standard Gaussian measure (Luther et al., 2019).
Key findings:
- Kaiming initialization preserves at every layer (assuming unit-variance input).
- However, as depth ,
- (mean squares dominate)
- (sample variance collapses)
This means all random input vectors become nearly collinear in activation space: unit-wise pre-activations are almost deterministic (fixed) up to a network-dependent bias, even as total variance remains constant.
3. Empirical Verification in Finite-Width Networks
Numerical experiments in MLPs of varying widths and depths demonstrate:
- For small width , sample variance decay is less pronounced due to imperfect self-averaging.
- As increases, empirical ratios closely track the infinite-width predictions.
- Even for , after 50 layers, decays by an order of magnitude, while total variance remains stable.
- The phenomenon generalizes to contemporary architectures (e.g., ALL-CNN-C on CIFAR-10, U-Net on ISBI), confirming robustness of the observed decay (Luther et al., 2019).
| Setting | Total Variance | Sample Variance | Ratio grows with |
|---|---|---|---|
| Kaiming-only (deep) | constant (preserved) | decays with | Yes |
| BatchNorm | preserved | preserved | No (fixed) |
4. Batch Normalization and Preservation of Sample Variance
Batch Normalization (BatchNorm) standardizes each feature at every layer to have sample mean zero and variance one over the batch:
where and are mini-batch empirical moments (Luther et al., 2019).
With BatchNorm, for random initial networks:
- at all layers
- Sample variance decay is eliminated
- A consequence is each layer's backward gradient amplifies by a factor , driving deep untrained networks towards the “chaotic” regime with exponentially increasing gradients.
5. Layerwise Bias–Variance Decomposition in In-Context Learning
The concept of variance decomposition generalizes to in-context learning (ICL) in LLMs. Here, task representations are extracted at specific layers as the hidden state of a separator token before the query (Jiang et al., 22 May 2025).
Let denote the number of demonstrations, and be the oracle (infinite-) task embedding. Then, at layer ,
- Variance:
- Bias:
A primary result (under linear-attention) is that both bias and variance decay as :
This explains why increasing the number of demonstrations improves ICL performance: more demonstrations allow the model to compress task information into a lower-variance, lower-bias “task vector” in early layers. The “expansion” stage in later layers then integrates query information, increasing variance again as the model conditions its prediction (Jiang et al., 22 May 2025).
6. Connections to Training Dynamics and Empirical Performance
Preserving sample variance at initialization—rather than total variance alone—leads to faster convergence:
- Data-dependent scale + bias initializations that ensure mean-zero, unit sample variance per layer accelerate training compared to total-variance-only initializations.
- In benchmarks (e.g., ALL-CNN-C/CIFAR10), scale+bias schemes reach 10% training loss 40% faster than scale-only, with performance competitive with BatchNorm (Luther et al., 2019).
- Fewer demonstration samples in ICL increase both bias and variance in the compressed task representation, flattening the task/instance separation and reducing performance; larger models further reduce minimum achievable variance, yielding cleaner task compression (Jiang et al., 22 May 2025).
7. Broader Implications for Representation Dynamics
Layerwise variance decomposition reveals structurally important phenomena:
- In randomly initialized deep ReLU networks, sample variance decay leads to information collapse in deep layers, explained precisely by the decomposition (Luther et al., 2019).
- BatchNorm and similar mechanisms that preserve or rescale sample variance prevent this collapse, but at a potential cost of gradient explosion.
- In modern attention-based architectures, the compression–expansion cycle observed via layerwise bias–variance analysis of internal representations underpins the effectiveness and robustness of ICL, characterizing how models distill, retain, and leverage task information (Jiang et al., 22 May 2025).
The theoretical and empirical framework of layerwise variance decomposition thus serves as a central tool for dissecting initialization strategies, understanding training speed, and interrogating internal representation evolution in both classical and contemporary neural architectures.