- The paper establishes tight bounds, proving that for k-sparse representations, linear decoding requires d = O(k² log m) and at least Ω((k²/log k) log(m/k)).
- The methodology leverages classical compressed sensing, random matrix theory, and combinatorial graph analysis to model feature superposition within neural activations.
- The results imply that with sparse activations, neural models can store exponentially many features, impacting interpretability and guiding future architectural designs.
Quantitative Limits on Linear Feature Storage in LLMs under the Linear Representation Hypothesis
Introduction and Motivation
This paper develops a rigorous mathematical characterization of the Linear Representation Hypothesis (LRH) in the context of neural LLMs. The LRH, which has informed much of empirical analysis and interpretability work in deep learning, posits that intermediate activations in neural networks store diverse features in a linearly structured fashion. The hypothesis is often taken to mean that features are not only embedded linearly within neuron activations (“linear representation”) but can also be efficiently extracted with linear probes (“linear accessibility”). Despite widespread empirical and conceptual use, the LRH's true theoretical capacity—particularly regarding how many features can be maintained and accessed with fixed resources—remains ambiguous.
The central contribution of this paper is to formalize and sharply delineate the distinction between linear representation and linear accessibility for feature storage, and to derive tight upper and lower bounds on how many features can be stored in a d-dimensional neural activation under different decoding strategies. The work leverages and extends results from classical compressed sensing, culminating in new theoretical tools that define the expressiveness (and constraints) of superposition in neural networks.
Mathematical Framework
Activations, Features, and Probes
The formulation maps an input from some language L to a d-dimensional vector f(ℓ), with “features” being arbitrary functions zi:L→R. Linear representation entails that there exists a matrix A∈Rd×m (with m features) such that f(ℓ)=Az(ℓ), where z(ℓ) is the vector of feature values for a given input. Linear accessibility requires that each feature can be approximately recovered from activations by a linear probe: ∣⟨bi,f(ℓ)⟩−zi(ℓ)∣<ϵ for all inputs ℓ in some set S.
A key abstraction is the notion of k-sparsity: on any given input, at most k features are active (nonzero). This matches linguistic intuition and is critical in the resulting bounds.
Accessibility Paradigms
Two paradigms are defined:
- General (Nonlinear) Accessibility: Features are linearly embedded, but recovery can use arbitrary (nonlinear) procedures.
- Linear Accessibility: Both embedding and decoding are linear.
The main analytical question is: for fixed sparsity k and feature count m, what is the minimum d such that all k-sparse m-dimensional inputs can be faithfully represented and recovered?
Theoretical Results
Classical Compressed Sensing (Allowing Nonlinear Decoding)
Under general accessibility, classical compressed sensing results hold: a random A with d=O(klog(m/k)) suffices for exact recovery of all k-sparse signals using nonlinear decoders, e.g., ℓ1 minimization. This underpins much of the empirical superposition hypothesis and is close to information-theoretic optimality. However, this regime does not align with the structural constraint that only linear probes—bijections of the form BTf(ℓ)—are permitted in neural architectures.
Linear Compressed Sensing (Linear Decoding Constraint)
The novel contribution concerns linear accessibility:
- Upper Bound: d=Oϵ(k2logm). There exists a (random) A and B such that any k-sparse vector z can be approximately reconstructed by BTAz within ϵ in ℓ∞ norm.
- Lower Bound: d=Ωϵ(logkk2logkm). For any such A and B achieving the required recovery for all k-sparse z, this lower bound is necessary.
Thus, compared to classical compressed sensing (linear in k), linear accessibility introduces a substantial quantitative penalty: d must be quadratic in k. This defines the strict representational bottleneck for linear feature superposition in practical LLMs, as posed by the LRH.
Implications for Feature Geometry
An important sub-result is that these bounds do not force feature directions (representation or probes) to be orthogonal, and in fact, highly correlated probe and representation vectors are permissible, provided cross-interference is controlled. However, imposing normalized (unit-norm) constraints on representations and probes enforces near-orthogonality between features, consistent with standard geometric intuition from neural embedding analyses.
Robustness to Nonlinear Activation and Bias in Probes
The paper generalizes lower bound results to settings where the probing operation is of the form g(x)=σ(WTx+b) with σ monotonic (e.g., ReLU) and b a bias. It is shown that this does not increase the number of retrievable features beyond the established linear bounds, confirming that the quadratic-in-k scaling is fundamental even for standard neural activation functions.
Proof Techniques
The upper bound proof leverages random matrix theory and the construction of incoherent matrices—where every column is close to orthogonal with every other, up to a parameter tuneable with d. The total interference under k-sparsity is then tightly bounded via the union bound.
The lower bound innovatively utilizes results about the rank of near-identity matrices (Alon et al.) and applications of Turán's theorem from extremal graph theory. By examining the combinatorial structure among probe and feature directions (encoded as a graph where edges represent significant interference), the existence of cliques/independent sets bounds the required rank and thus the required embedding dimension.
Implications and Future Directions
The theoretical findings have multiple critical implications:
- Exponential Storage: Even with linear accessibility, it is possible—with sufficient sparsity—for a neural layer to store exponentially many features (m) relative to its width (d).
- Expressive Power and Bottlenecks: There is a sharp difference between representational and accessible capacity; empirical findings using probes or autoencoders should be interpreted in light of these constraints.
- Interpretability and Causality: The results provide rigorous support for empirical phenomena around linear probes and sparse autoencoders as tools for interpreting network behavior, but caution that not all linearly represented content is necessarily linearly accessible.
- Architectural Considerations: As subsequent layers can only realize a linear number of probes per layer, network depth and feature utilization across layers remain areas ripe for mathematical exploration.
- Extensions: The paper points to further formalization of alternate representation hypotheses, exploration of nonlinear representation with linear accessibility, and the compositionality of feature extraction across many layers.
Conclusion
By precisely quantifying the capacity for linearly retrieving features stored via superposition in neural network activations, this work provides foundational upper and lower bounds that anchor the linear representation hypothesis within a rigorous mathematical landscape. The distinction between linear representation and accessibility is shown to be quantitatively fundamental, affecting both theoretical interpretation and practical exploitability of neural activations for probing and manipulation. These results enable more principled reasoning about the expressive and bottleneck properties of deep networks and motivate further study of how these constraints propagate and interact in multilayer architectures.
Reference: "How Many Features Can a LLM Store Under the Linear Representation Hypothesis?" (2602.11246)