- The paper shows that representation superposition underpins neural scaling, with loss decaying roughly as 1/m in the strong superposition regime.
- Methodology uses a toy model with controlled weight decay to distinguish weak scaling sensitive to feature frequency from robust scaling via overlapping representations.
- Empirical analysis on models like GPT-2 and OPT confirms a universal exponent near 1, highlighting superposition’s key role in efficient representation learning.
This paper, "Superposition Yields Robust Neural Scaling" (2505.10465), investigates the origin of neural scaling laws in LLMs, particularly the observation that test loss decreases predictably as a power law with increasing model size. The authors propose that representation superposition, the phenomenon where models represent more features than their hidden dimensionality allows by encoding features in overlapping vectors, is a key mechanism driving this scaling.
The research uses a simplified toy model to paper the interplay between superposition and data structure. The toy model involves learning a weight matrix W to reconstruct input data x, where x is composed of n atomic features with varying frequencies pi. The model's hidden dimension m is much smaller than n. The loss is defined as the squared error between the reconstructed and original data. The authors control the degree of superposition in the toy model using a modified weight decay term in the optimizer. Strong superposition is achieved when many features (i.e., rows of W) have significant norms, indicating they are represented, even when their number exceeds the model dimension m. Weak superposition means only the most frequent features are represented without significant overlap.
Through experiments with the toy model, the authors identify distinct scaling behaviors based on the degree of superposition and the feature frequency distribution.
- Weak Superposition: When superposition is weak, the loss scaling with model dimension m is sensitive to the distribution of feature frequencies pi. If pi follows a power law pi∝i−α with α>1, the loss scales as a power law L∝m−αm with αm≈α−1. This is explained by the fact that only the most frequent features up to a certain rank are represented, and the loss is dominated by the sum of probabilities of the unrepresented features. For feature distributions that are not power laws, the loss-vs-model-dimension curve in log-log space is not a straight line, indicating a deviation from power-law scaling.
- Strong Superposition: In the strong superposition regime, where many features are represented with overlapping vectors, the loss scaling is found to be robust across a wide range of feature frequency distributions. The loss scales approximately as L∝m−αm with αm≈1. This robust scaling is attributed to the geometry of high-dimensional spaces. When many vectors are packed into a lower dimension, the typical squared overlap between normalized vectors scales inversely with the dimension ($1/m$), as suggested by bounds related to equiangular tight frames (ETFs). The loss, which is influenced by these overlaps, consequently scales as $1/m$. For extremely skewed feature distributions (large α), the model exponent αm can become larger than 1, approximately 2(α−1), as the loss becomes dominated by the contribution of less frequent features represented with larger overlaps. Activation density (expected number of activated features per data point) affects the constant factor of the loss but not the scaling exponent.
To connect these findings to real-world LLMs, the authors analyze several open-sourced LLMs (OPT, GPT-2, Qwen2.5, Pythia) of varying sizes. They consider tokens as atomic features and analyze the LLM head weight matrix W. Their analysis shows that these LLMs exhibit strong superposition: all tokens are represented, and the minimum row norms of W are non-zero. They also observe that empirical token frequencies follow a power law with an exponent α≈1. Consistent with the toy model's prediction for strong superposition and α≈1, they find that the mean squared overlaps of normalized rows in the LLM LLM heads roughly scale as $1/m$. Furthermore, by fitting the empirical evaluation losses of these LLMs across different datasets and model dimensions m, they find a universal model exponent αm=0.91±0.04, which is close to 1. They also infer a similar exponent for Chinchilla models. This quantitative agreement suggests that LLMs operate in the strong superposition regime and their scaling behavior is significantly shaped by this mechanism.
From an implementation perspective, the paper highlights that the efficiency of representation learning, specifically how well models can represent numerous features in a limited-dimensional space, is critical for scaling. The observation that weight decay can tune superposition in the toy model suggests that optimizer choices and regularization techniques could influence representation learning in real LLMs. The authors propose that techniques promoting strong superposition, such as normalizing hidden states or weights (similar to the unit-norm constraint explored with negative weight decay in the toy model or architectures like nGPT), could lead to more efficient models or training strategies. They also note that current LLM architectures might be approaching the limits of efficiency gains from merely increasing width/model dimension, suggesting a potential shift towards depth scaling in larger models.
Practical considerations for applying these insights include understanding the trade-offs between different model sizes and architectures based on whether they are representation-limited (addressed by width and superposition) or parsing-limited (potentially addressed by depth). Computational resources required for training models exhibiting strong superposition might benefit from optimizers designed to manage large numbers of overlapping feature representations efficiently.
The paper acknowledges limitations, including the simplicity of the toy model compared to complex LLMs and the lack of a rigorous analytical solution for the toy model behavior in all regimes. Future work could explore the parsing-limited scaling regime, the specific mechanisms in LLMs that lead to strong superposition without explicit norm constraints, and how superposition affects emergent abilities beyond just minimizing pre-training loss.