How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?

Published 11 Feb 2026 in cs.LG, cs.AI, cs.CL, cs.IT, and math.CO | (2602.11246v1)

Abstract: We introduce a mathematical framework for the linear representation hypothesis (LRH), which asserts that intermediate layers of LLMs store features linearly. We separate the hypothesis into two claims: linear representation (features are linearly embedded in neuron activations) and linear accessibility (features can be linearly decoded). We then ask: How many neurons $d$ suffice to both linearly represent and linearly access $m$ features? Classical results in compressed sensing imply that for $k$-sparse inputs, $d = O(k\log (m/k))$ suffices if we allow non-linear decoding algorithms (Candes and Tao, 2006; Candes et al., 2006; Donoho, 2006). However, the additional requirement of linear decoding takes the problem out of the classical compressed sensing, into linear compressed sensing. Our main theoretical result establishes nearly-matching upper and lower bounds for linear compressed sensing. We prove that $d = Ωε(\frac{k^2}{\log k}\log (m/k))$ is required while $d = Oε(k^2\log m)$ suffices. The lower bound establishes a quantitative gap between classical and linear compressed setting, illustrating how linear accessibility is a meaningfully stronger hypothesis than linear representation alone. The upper bound confirms that neurons can store an exponential number of features under the LRH, giving theoretical evidence for the "superposition hypothesis" (Elhage et al., 2022). The upper bound proof uses standard random constructions of matrices with approximately orthogonal columns. The lower bound proof uses rank bounds for near-identity matrices (Alon, 2003) together with Turán's theorem (bounding the number of edges in clique-free graphs). We also show how our results do and do not constrain the geometry of feature representations and extend our results to allow decoders with an activation function and bias.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper establishes tight bounds, proving that for k-sparse representations, linear decoding requires d = O(k² log m) and at least Ω((k²/log k) log(m/k)).
The methodology leverages classical compressed sensing, random matrix theory, and combinatorial graph analysis to model feature superposition within neural activations.
The results imply that with sparse activations, neural models can store exponentially many features, impacting interpretability and guiding future architectural designs.

Quantitative Limits on Linear Feature Storage in LLMs under the Linear Representation Hypothesis

Introduction and Motivation

This paper develops a rigorous mathematical characterization of the Linear Representation Hypothesis (LRH) in the context of neural LLMs. The LRH, which has informed much of empirical analysis and interpretability work in deep learning, posits that intermediate activations in neural networks store diverse features in a linearly structured fashion. The hypothesis is often taken to mean that features are not only embedded linearly within neuron activations (“linear representation”) but can also be efficiently extracted with linear probes (“linear accessibility”). Despite widespread empirical and conceptual use, the LRH's true theoretical capacity—particularly regarding how many features can be maintained and accessed with fixed resources—remains ambiguous.

The central contribution of this paper is to formalize and sharply delineate the distinction between linear representation and linear accessibility for feature storage, and to derive tight upper and lower bounds on how many features can be stored in a $d$ -dimensional neural activation under different decoding strategies. The work leverages and extends results from classical compressed sensing, culminating in new theoretical tools that define the expressiveness (and constraints) of superposition in neural networks.

Mathematical Framework

Activations, Features, and Probes

The formulation maps an input from some language $L$ to a $d$ -dimensional vector $f(\ell)$ , with “features” being arbitrary functions $z_i: L \to \mathbb{R}$ . Linear representation entails that there exists a matrix $A \in \mathbb{R}^{d \times m}$ (with $m$ features) such that $f(\ell) = Az(\ell)$ , where $z(\ell)$ is the vector of feature values for a given input. Linear accessibility requires that each feature can be approximately recovered from activations by a linear probe: $|\langle b_i, f(\ell)\rangle - z_i(\ell)| < \epsilon$ for all inputs $\ell$ in some set $S$ .

A key abstraction is the notion of $k$ -sparsity: on any given input, at most $k$ features are active (nonzero). This matches linguistic intuition and is critical in the resulting bounds.

Accessibility Paradigms

Two paradigms are defined:

General (Nonlinear) Accessibility: Features are linearly embedded, but recovery can use arbitrary (nonlinear) procedures.
Linear Accessibility: Both embedding and decoding are linear.

The main analytical question is: for fixed sparsity $k$ and feature count $m$ , what is the minimum $d$ such that all $k$ -sparse $m$ -dimensional inputs can be faithfully represented and recovered?

Theoretical Results

Classical Compressed Sensing (Allowing Nonlinear Decoding)

Under general accessibility, classical compressed sensing results hold: a random $A$ with $d = O(k \log (m/k))$ suffices for exact recovery of all $k$ -sparse signals using nonlinear decoders, e.g., $\ell_1$ minimization. This underpins much of the empirical superposition hypothesis and is close to information-theoretic optimality. However, this regime does not align with the structural constraint that only linear probes—bijections of the form $B^T f(\ell)$ —are permitted in neural architectures.

Linear Compressed Sensing (Linear Decoding Constraint)

The novel contribution concerns linear accessibility:

Upper Bound: $d = O_\epsilon(k^2 \log m)$ . There exists a (random) $A$ and $B$ such that any $k$ -sparse vector $z$ can be approximately reconstructed by $B^T A z$ within $\epsilon$ in $\ell_\infty$ norm.
Lower Bound: $d = \Omega_\epsilon\left(\frac{k^2}{\log k} \log \frac{m}{k}\right)$ . For any such $A$ and $B$ achieving the required recovery for all $k$ -sparse $z$ , this lower bound is necessary.

Thus, compared to classical compressed sensing (linear in $k$ ), linear accessibility introduces a substantial quantitative penalty: $d$ must be quadratic in $k$ . This defines the strict representational bottleneck for linear feature superposition in practical LLMs, as posed by the LRH.

Implications for Feature Geometry

An important sub-result is that these bounds do not force feature directions (representation or probes) to be orthogonal, and in fact, highly correlated probe and representation vectors are permissible, provided cross-interference is controlled. However, imposing normalized (unit-norm) constraints on representations and probes enforces near-orthogonality between features, consistent with standard geometric intuition from neural embedding analyses.

Robustness to Nonlinear Activation and Bias in Probes

The paper generalizes lower bound results to settings where the probing operation is of the form $g(x) = \sigma(W^T x + b)$ with $\sigma$ monotonic (e.g., ReLU) and $b$ a bias. It is shown that this does not increase the number of retrievable features beyond the established linear bounds, confirming that the quadratic-in- $k$ scaling is fundamental even for standard neural activation functions.

Proof Techniques

The upper bound proof leverages random matrix theory and the construction of incoherent matrices—where every column is close to orthogonal with every other, up to a parameter tuneable with $d$ . The total interference under $k$ -sparsity is then tightly bounded via the union bound.

The lower bound innovatively utilizes results about the rank of near-identity matrices (Alon et al.) and applications of Turán's theorem from extremal graph theory. By examining the combinatorial structure among probe and feature directions (encoded as a graph where edges represent significant interference), the existence of cliques/independent sets bounds the required rank and thus the required embedding dimension.

Implications and Future Directions

The theoretical findings have multiple critical implications:

Exponential Storage: Even with linear accessibility, it is possible—with sufficient sparsity—for a neural layer to store exponentially many features ( $m$ ) relative to its width ( $d$ ).
Expressive Power and Bottlenecks: There is a sharp difference between representational and accessible capacity; empirical findings using probes or autoencoders should be interpreted in light of these constraints.
Interpretability and Causality: The results provide rigorous support for empirical phenomena around linear probes and sparse autoencoders as tools for interpreting network behavior, but caution that not all linearly represented content is necessarily linearly accessible.
Architectural Considerations: As subsequent layers can only realize a linear number of probes per layer, network depth and feature utilization across layers remain areas ripe for mathematical exploration.
Extensions: The paper points to further formalization of alternate representation hypotheses, exploration of nonlinear representation with linear accessibility, and the compositionality of feature extraction across many layers.

Conclusion

By precisely quantifying the capacity for linearly retrieving features stored via superposition in neural network activations, this work provides foundational upper and lower bounds that anchor the linear representation hypothesis within a rigorous mathematical landscape. The distinction between linear representation and accessibility is shown to be quantitatively fundamental, affecting both theoretical interpretation and practical exploitability of neural activations for probing and manipulation. These results enable more principled reasoning about the expressive and bottleneck properties of deep networks and motivate further study of how these constraints propagate and interact in multilayer architectures.

Reference: "How Many Features Can a LLM Store Under the Linear Representation Hypothesis?" (2602.11246)

Markdown Report Issue