Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Superposition Yields Robust Neural Scaling (2505.10465v2)

Published 15 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: The success of today's LLMs depends on the observation that larger models perform better. However, the origin of this neural scaling law -- the finding that loss decreases as a power law with model size -- remains unclear. Starting from two empirical principles -- that LLMs represent more things than the model dimensions (widths) they have (i.e., representations are superposed), and that words or concepts in language occur with varying frequencies -- we constructed a toy model to study the loss scaling with model size. We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency; if feature frequencies follow a power law, so does the loss. In contrast, under strong superposition, where all features are represented but overlap with each other, the loss becomes inversely proportional to the model dimension across a wide range of feature frequency distributions. This robust scaling behavior is explained geometrically: when many more vectors are packed into a lower dimensional space, the interference (squared overlaps) between vectors scales inversely with that dimension. We then analyzed four families of open-sourced LLMs and found that they exhibit strong superposition and quantitatively match the predictions of our toy model. The Chinchilla scaling law turned out to also agree with our results. We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws. We anticipate that these insights will inspire new training strategies and model architectures to achieve better performance with less computation and fewer parameters.

Summary

  • The paper shows that representation superposition underpins neural scaling, with loss decaying roughly as 1/m in the strong superposition regime.
  • Methodology uses a toy model with controlled weight decay to distinguish weak scaling sensitive to feature frequency from robust scaling via overlapping representations.
  • Empirical analysis on models like GPT-2 and OPT confirms a universal exponent near 1, highlighting superposition’s key role in efficient representation learning.

This paper, "Superposition Yields Robust Neural Scaling" (2505.10465), investigates the origin of neural scaling laws in LLMs, particularly the observation that test loss decreases predictably as a power law with increasing model size. The authors propose that representation superposition, the phenomenon where models represent more features than their hidden dimensionality allows by encoding features in overlapping vectors, is a key mechanism driving this scaling.

The research uses a simplified toy model to paper the interplay between superposition and data structure. The toy model involves learning a weight matrix WW to reconstruct input data xx, where xx is composed of nn atomic features with varying frequencies pip_i. The model's hidden dimension mm is much smaller than nn. The loss is defined as the squared error between the reconstructed and original data. The authors control the degree of superposition in the toy model using a modified weight decay term in the optimizer. Strong superposition is achieved when many features (i.e.i.e., rows of WW) have significant norms, indicating they are represented, even when their number exceeds the model dimension mm. Weak superposition means only the most frequent features are represented without significant overlap.

Through experiments with the toy model, the authors identify distinct scaling behaviors based on the degree of superposition and the feature frequency distribution.

  • Weak Superposition: When superposition is weak, the loss scaling with model dimension mm is sensitive to the distribution of feature frequencies pip_i. If pip_i follows a power law piiαp_i \propto i^{-\alpha} with α>1\alpha > 1, the loss scales as a power law LmαmL \propto m^{-\alpha_m} with αmα1\alpha_m \approx \alpha - 1. This is explained by the fact that only the most frequent features up to a certain rank are represented, and the loss is dominated by the sum of probabilities of the unrepresented features. For feature distributions that are not power laws, the loss-vs-model-dimension curve in log-log space is not a straight line, indicating a deviation from power-law scaling.
  • Strong Superposition: In the strong superposition regime, where many features are represented with overlapping vectors, the loss scaling is found to be robust across a wide range of feature frequency distributions. The loss scales approximately as LmαmL \propto m^{-\alpha_m} with αm1\alpha_m \approx 1. This robust scaling is attributed to the geometry of high-dimensional spaces. When many vectors are packed into a lower dimension, the typical squared overlap between normalized vectors scales inversely with the dimension ($1/m$), as suggested by bounds related to equiangular tight frames (ETFs). The loss, which is influenced by these overlaps, consequently scales as $1/m$. For extremely skewed feature distributions (large α\alpha), the model exponent αm\alpha_m can become larger than 1, approximately 2(α1)2(\alpha-1), as the loss becomes dominated by the contribution of less frequent features represented with larger overlaps. Activation density (expected number of activated features per data point) affects the constant factor of the loss but not the scaling exponent.

To connect these findings to real-world LLMs, the authors analyze several open-sourced LLMs (OPT, GPT-2, Qwen2.5, Pythia) of varying sizes. They consider tokens as atomic features and analyze the LLM head weight matrix WW. Their analysis shows that these LLMs exhibit strong superposition: all tokens are represented, and the minimum row norms of WW are non-zero. They also observe that empirical token frequencies follow a power law with an exponent α1\alpha \approx 1. Consistent with the toy model's prediction for strong superposition and α1\alpha \approx 1, they find that the mean squared overlaps of normalized rows in the LLM LLM heads roughly scale as $1/m$. Furthermore, by fitting the empirical evaluation losses of these LLMs across different datasets and model dimensions mm, they find a universal model exponent αm=0.91±0.04\alpha_m = 0.91 \pm 0.04, which is close to 1. They also infer a similar exponent for Chinchilla models. This quantitative agreement suggests that LLMs operate in the strong superposition regime and their scaling behavior is significantly shaped by this mechanism.

From an implementation perspective, the paper highlights that the efficiency of representation learning, specifically how well models can represent numerous features in a limited-dimensional space, is critical for scaling. The observation that weight decay can tune superposition in the toy model suggests that optimizer choices and regularization techniques could influence representation learning in real LLMs. The authors propose that techniques promoting strong superposition, such as normalizing hidden states or weights (similar to the unit-norm constraint explored with negative weight decay in the toy model or architectures like nGPT), could lead to more efficient models or training strategies. They also note that current LLM architectures might be approaching the limits of efficiency gains from merely increasing width/model dimension, suggesting a potential shift towards depth scaling in larger models.

Practical considerations for applying these insights include understanding the trade-offs between different model sizes and architectures based on whether they are representation-limited (addressed by width and superposition) or parsing-limited (potentially addressed by depth). Computational resources required for training models exhibiting strong superposition might benefit from optimizers designed to manage large numbers of overlapping feature representations efficiently.

The paper acknowledges limitations, including the simplicity of the toy model compared to complex LLMs and the lack of a rigorous analytical solution for the toy model behavior in all regimes. Future work could explore the parsing-limited scaling regime, the specific mechanisms in LLMs that lead to strong superposition without explicit norm constraints, and how superposition affects emergent abilities beyond just minimizing pre-training loss.