Papers
Topics
Authors
Recent
Search
2000 character limit reached

How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?

Published 11 Feb 2026 in cs.LG, cs.AI, cs.CL, cs.IT, and math.CO | (2602.11246v1)

Abstract: We introduce a mathematical framework for the linear representation hypothesis (LRH), which asserts that intermediate layers of LLMs store features linearly. We separate the hypothesis into two claims: linear representation (features are linearly embedded in neuron activations) and linear accessibility (features can be linearly decoded). We then ask: How many neurons $d$ suffice to both linearly represent and linearly access $m$ features? Classical results in compressed sensing imply that for $k$-sparse inputs, $d = O(k\log (m/k))$ suffices if we allow non-linear decoding algorithms (Candes and Tao, 2006; Candes et al., 2006; Donoho, 2006). However, the additional requirement of linear decoding takes the problem out of the classical compressed sensing, into linear compressed sensing. Our main theoretical result establishes nearly-matching upper and lower bounds for linear compressed sensing. We prove that $d = Ωε(\frac{k2}{\log k}\log (m/k))$ is required while $d = Oε(k2\log m)$ suffices. The lower bound establishes a quantitative gap between classical and linear compressed setting, illustrating how linear accessibility is a meaningfully stronger hypothesis than linear representation alone. The upper bound confirms that neurons can store an exponential number of features under the LRH, giving theoretical evidence for the "superposition hypothesis" (Elhage et al., 2022). The upper bound proof uses standard random constructions of matrices with approximately orthogonal columns. The lower bound proof uses rank bounds for near-identity matrices (Alon, 2003) together with Turán's theorem (bounding the number of edges in clique-free graphs). We also show how our results do and do not constrain the geometry of feature representations and extend our results to allow decoders with an activation function and bias.

Summary

  • The paper establishes tight bounds, proving that for k-sparse representations, linear decoding requires d = O(k² log m) and at least Ω((k²/log k) log(m/k)).
  • The methodology leverages classical compressed sensing, random matrix theory, and combinatorial graph analysis to model feature superposition within neural activations.
  • The results imply that with sparse activations, neural models can store exponentially many features, impacting interpretability and guiding future architectural designs.

Quantitative Limits on Linear Feature Storage in LLMs under the Linear Representation Hypothesis

Introduction and Motivation

This paper develops a rigorous mathematical characterization of the Linear Representation Hypothesis (LRH) in the context of neural LLMs. The LRH, which has informed much of empirical analysis and interpretability work in deep learning, posits that intermediate activations in neural networks store diverse features in a linearly structured fashion. The hypothesis is often taken to mean that features are not only embedded linearly within neuron activations (“linear representation”) but can also be efficiently extracted with linear probes (“linear accessibility”). Despite widespread empirical and conceptual use, the LRH's true theoretical capacity—particularly regarding how many features can be maintained and accessed with fixed resources—remains ambiguous.

The central contribution of this paper is to formalize and sharply delineate the distinction between linear representation and linear accessibility for feature storage, and to derive tight upper and lower bounds on how many features can be stored in a dd-dimensional neural activation under different decoding strategies. The work leverages and extends results from classical compressed sensing, culminating in new theoretical tools that define the expressiveness (and constraints) of superposition in neural networks.

Mathematical Framework

Activations, Features, and Probes

The formulation maps an input from some language LL to a dd-dimensional vector f()f(\ell), with “features” being arbitrary functions zi:LRz_i: L \to \mathbb{R}. Linear representation entails that there exists a matrix ARd×mA \in \mathbb{R}^{d \times m} (with mm features) such that f()=Az()f(\ell) = Az(\ell), where z()z(\ell) is the vector of feature values for a given input. Linear accessibility requires that each feature can be approximately recovered from activations by a linear probe: bi,f()zi()<ϵ|\langle b_i, f(\ell)\rangle - z_i(\ell)| < \epsilon for all inputs \ell in some set SS.

A key abstraction is the notion of kk-sparsity: on any given input, at most kk features are active (nonzero). This matches linguistic intuition and is critical in the resulting bounds.

Accessibility Paradigms

Two paradigms are defined:

  1. General (Nonlinear) Accessibility: Features are linearly embedded, but recovery can use arbitrary (nonlinear) procedures.
  2. Linear Accessibility: Both embedding and decoding are linear.

The main analytical question is: for fixed sparsity kk and feature count mm, what is the minimum dd such that all kk-sparse mm-dimensional inputs can be faithfully represented and recovered?

Theoretical Results

Classical Compressed Sensing (Allowing Nonlinear Decoding)

Under general accessibility, classical compressed sensing results hold: a random AA with d=O(klog(m/k))d = O(k \log (m/k)) suffices for exact recovery of all kk-sparse signals using nonlinear decoders, e.g., 1\ell_1 minimization. This underpins much of the empirical superposition hypothesis and is close to information-theoretic optimality. However, this regime does not align with the structural constraint that only linear probes—bijections of the form BTf()B^T f(\ell)—are permitted in neural architectures.

Linear Compressed Sensing (Linear Decoding Constraint)

The novel contribution concerns linear accessibility:

  • Upper Bound: d=Oϵ(k2logm)d = O_\epsilon(k^2 \log m). There exists a (random) AA and BB such that any kk-sparse vector zz can be approximately reconstructed by BTAzB^T A z within ϵ\epsilon in \ell_\infty norm.
  • Lower Bound: d=Ωϵ(k2logklogmk)d = \Omega_\epsilon\left(\frac{k^2}{\log k} \log \frac{m}{k}\right). For any such AA and BB achieving the required recovery for all kk-sparse zz, this lower bound is necessary.

Thus, compared to classical compressed sensing (linear in kk), linear accessibility introduces a substantial quantitative penalty: dd must be quadratic in kk. This defines the strict representational bottleneck for linear feature superposition in practical LLMs, as posed by the LRH.

Implications for Feature Geometry

An important sub-result is that these bounds do not force feature directions (representation or probes) to be orthogonal, and in fact, highly correlated probe and representation vectors are permissible, provided cross-interference is controlled. However, imposing normalized (unit-norm) constraints on representations and probes enforces near-orthogonality between features, consistent with standard geometric intuition from neural embedding analyses.

Robustness to Nonlinear Activation and Bias in Probes

The paper generalizes lower bound results to settings where the probing operation is of the form g(x)=σ(WTx+b)g(x) = \sigma(W^T x + b) with σ\sigma monotonic (e.g., ReLU) and bb a bias. It is shown that this does not increase the number of retrievable features beyond the established linear bounds, confirming that the quadratic-in-kk scaling is fundamental even for standard neural activation functions.

Proof Techniques

The upper bound proof leverages random matrix theory and the construction of incoherent matrices—where every column is close to orthogonal with every other, up to a parameter tuneable with dd. The total interference under kk-sparsity is then tightly bounded via the union bound.

The lower bound innovatively utilizes results about the rank of near-identity matrices (Alon et al.) and applications of Turán's theorem from extremal graph theory. By examining the combinatorial structure among probe and feature directions (encoded as a graph where edges represent significant interference), the existence of cliques/independent sets bounds the required rank and thus the required embedding dimension.

Implications and Future Directions

The theoretical findings have multiple critical implications:

  • Exponential Storage: Even with linear accessibility, it is possible—with sufficient sparsity—for a neural layer to store exponentially many features (mm) relative to its width (dd).
  • Expressive Power and Bottlenecks: There is a sharp difference between representational and accessible capacity; empirical findings using probes or autoencoders should be interpreted in light of these constraints.
  • Interpretability and Causality: The results provide rigorous support for empirical phenomena around linear probes and sparse autoencoders as tools for interpreting network behavior, but caution that not all linearly represented content is necessarily linearly accessible.
  • Architectural Considerations: As subsequent layers can only realize a linear number of probes per layer, network depth and feature utilization across layers remain areas ripe for mathematical exploration.
  • Extensions: The paper points to further formalization of alternate representation hypotheses, exploration of nonlinear representation with linear accessibility, and the compositionality of feature extraction across many layers.

Conclusion

By precisely quantifying the capacity for linearly retrieving features stored via superposition in neural network activations, this work provides foundational upper and lower bounds that anchor the linear representation hypothesis within a rigorous mathematical landscape. The distinction between linear representation and accessibility is shown to be quantitatively fundamental, affecting both theoretical interpretation and practical exploitability of neural activations for probing and manipulation. These results enable more principled reasoning about the expressive and bottleneck properties of deep networks and motivate further study of how these constraints propagate and interact in multilayer architectures.

Reference: "How Many Features Can a LLM Store Under the Linear Representation Hypothesis?" (2602.11246)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 13 tweets with 58 likes about this paper.