Ensemble structure of middle layers in large language models
Characterize, within the ensemble-averaging interpretation of Transformer-based large language models trained for next-token prediction, whether the middle layers constitute a single ensemble approximating one function group or a fixed number of ensembles/function groups, and determine how increasing total depth allocates layers among these ensembles.
References
Within ensemble averaging, there can be more general situations that our study cannot distinguish. The middle layers may not be in one ensemble for one function group, but a fixed number of ensembles or function groups.
— Inverse Depth Scaling From Most Layers Being Similar
(2602.05970 - Liu et al., 5 Feb 2026) in Section 6 (Discussion)