Neural Network Superposition

Updated 22 February 2026

Superposition in neural networks is the encoding of many overlapping features within a lower-dimensional activation space, enabling lossy but efficient representation.
It boosts model expressivity by representing far more features than available neurons while creating challenges for clear interpretability and robust performance.
Entropy-based and sparse autoencoder techniques quantify and disentangle overlapping features, guiding improvements in model scaling, adversarial resilience, and interpretability.

Superposition in neural networks refers to the phenomenon where a network encodes many more features—interpretable directions or structured concepts—than there are physical neurons or activation space dimensions, by overlaying these features as non-orthogonal directions within shared representations. This lossy compression of information is a central mechanism by which deep models achieve high capacity, but it also introduces fundamental challenges in interpretability, robustness, and architectural scaling.

1. Mathematical Foundations and Formal Definitions

Superposition is formally defined as the regime where a set of $D \gg N$ interpretable features is encoded in an $N$ -dimensional activation space, such that feature directions overlap and induce interference. The canonical model posits that network activations $x \in \mathbb{R}^N$ are formed as linear combinations $x = W f + \epsilon$ , where $f \in \mathbb{R}^D$ is a $K$ -sparse latent variable, $W \in \mathbb{R}^{N \times D}$ is a mixing matrix, and $\epsilon$ denotes residual noise (Bereska et al., 15 Dec 2025, Longon et al., 3 Oct 2025). Feature superposition is characterized by multiple features sharing the same neuronal subspace, with the minimal number of dimensions necessary for interference-free encoding rendered infeasible.

An information-theoretic metric for superposition is derived as follows:

Let $p$ denote the empirical usage distribution of sparse autoencoder (SAE) features, estimated as $p_i = \frac{1}{Z}\sum_{s}|z_{i,s}|$ over all samples $s$ .
The Shannon entropy $H(p) = -\sum_{i} p_i \log p_i$ yields the effective number of interference-free “virtual neurons” as $F = \exp H(p)$ .
The superposition ratio, $\psi = F/N$ , quantifies degree of overlap: $\psi = 1$ signals no superposition, $\psi > 1$ denotes lossy compression, implying superposed representations (Bereska et al., 15 Dec 2025).

Superposition is distinct from mere representational compression: efficient computation of new features in superposition, rather than just encoding many features, imposes stricter lower bounds on the required number of neurons, as proven by recent complexity theoretic work (Adler et al., 2024).

2. Measuring and Disentangling Superposed Representations

Disentangling superposition requires methods that recover interpretable feature axes from mixed-layer activations. Sparse autoencoders and related dictionary learning techniques are essential for this, allowing recovery of nearly orthogonal latent codes from highly polysemantic neuronal responses (Klindt et al., 3 Mar 2025, Longon et al., 3 Oct 2025). The practical procedure, as detailed in (Bereska et al., 15 Dec 2025), is:

Collect a large sample of activations $x^{(s)} \in \mathbb{R}^N$ .
Train a sparse autoencoder with an overcomplete dictionary ( $D \gg N$ ) and an $\ell_1$ activation constraint.
Compute the empirical feature distribution $p_i$ from codes $z_{i,s}$ .
Calculate entropy $H(p)$ and infer $\psi = F/N$ .

This entropy-based metric correlates strongly with ground-truth superposition in controlled experiments where feature weights and sparsity can be exactly specified (Bereska et al., 15 Dec 2025). Moreover, superposition arrangements (the precise way that features are mixed within neurons) can deflate alignment measures across models or brain recordings, obscuring time or model-invariance unless sparse codes are explicitly recovered for both source and target (Longon et al., 3 Oct 2025).

3. Functional and Architectural Consequences

The functional significance of superposition is multifold:

Expressivity: Superposition enables neural networks to represent orders of magnitude more features than raw width would allow. Polysemanticity, wherein single neurons participate in the representation of multiple features, is a direct consequence of such dense packing (Elhage et al., 2022, Bereska et al., 15 Dec 2025).
Interference and Interpretability: Overlapping feature directions lead to interference—measurable as dot-products or cosine overlaps between code vectors—that undermines axis-aligned interpretability. Effective rank and superposition index measures have been developed to quantify the extent to which representational axes are shared or entangled (Pertl et al., 31 Aug 2025, Hollard et al., 21 Jul 2025).
Practical Model Performance: Empirically, wider architectures and architectural interventions (e.g., LayerNorm, depthwise convolutions in native channel space) can reduce unwanted interference, stabilizing scaling accuracy in low-parameter models (Hollard et al., 21 Jul 2025).
Computation in Superposition: Beyond representation, superposition allows direct computation of large families of Boolean functions (e.g., pairwise ANDs), with nearly tight complexity bounds showing that computing $m'$ features in superposition requires at least $\Omega(\sqrt{m' \log m'})$ neurons, contrasting the much lower requirements for merely storing sparse feature sets (Hänni et al., 2024, Adler et al., 2024).

4. Superposition, Adversarial Vulnerability, and Robustness

Superposition introduces a latent basis for adversarial vulnerability. Interference between overlapping features provides attack vectors by which small perturbations in input space coordinate substantial activation changes along unintended feature directions (Gorton et al., 24 Aug 2025, Stevinson et al., 13 Oct 2025). Key mechanisms include:

Non-orthogonal feature directions, where adversarial perturbations exploit structured overlaps to activate latent features, resulting in fragile decision boundaries and transferability of attack patterns between similarly superposed models.
Causal experiments demonstrate that increasing the degree of superposition (as measured by “features per dimension” or SAE entropy) monotonically increases adversarial vulnerability, while adversarial training reduces the level of superposition in capacity-constrained regimes (Bereska et al., 15 Dec 2025, Gorton et al., 24 Aug 2025).
The interaction between superposition, capacity, and robustness is nuanced. Adversarial training can either prune features (in scarce-capacity, complex-task regimes) or induce the emergence of additional robust features (in abundant-capacity, simple-task regimes); there is no universal monotonic link between superposition and vulnerability (Bereska et al., 15 Dec 2025).

5. Superposition and Training/Scaling Laws

Superposition alters the fundamental scaling behavior of deep networks. Under strong superposition (when nearly all features are embedded with substantial tension in a compressed space), the loss of a deep network—across diverse domains and architectures—generically scales as $1/m$ where $m$ is model width. This robust “one-over-width” law arises from the high-dimensional geometry of nearly isotropic feature codes, consistent with empirical scaling in LLMs and theoretical predictions from the Welch bound on tight frames (Liu et al., 15 May 2025, Chen et al., 1 Feb 2026).

In the absence of superposition (when only the most frequent features are represented, i.e., weak superposition), power-law scaling with model width emerges only if the data’s feature frequency distribution is itself heavy-tailed. The strong superposition regime, achieved by tuning weight decay or selecting wider models, ensures power-law scaling with an exponent close to one regardless of feature spectrum (Liu et al., 15 May 2025, Chen et al., 1 Feb 2026).

6. Superposition in Specialized Architectures and Tasks

Superposition is employed at several operational levels beyond vanilla deep networks:

Continual Learning and Multi-task Models: The SupSup paradigm demonstrates storage of thousands of tasks via superposition of binary masks within a fixed base network; task inference can be performed by entropy minimization without catastrophic forgetting (Wortsman et al., 2020). Parameter superposition techniques allow online packing of multiple models into a single set of parameters, with random context-based “binding” providing interference isolation (Cheung et al., 2019).
Graph Neural Networks and Bottlenecked Models: Message-passing architectures and low-parameter vision networks exhibit phase-like patterns in superposition, with distinct compression/expansion zones and architectural regularities that either exacerbate or suppress interference (Pertl et al., 31 Aug 2025, Hollard et al., 21 Jul 2025).
Computation in Superposition: Networks can efficiently compute many Boolean functions in superposed form; for instance, $m$ -input circuits can compute all $\binom{m}{2}$ ANDs with $\tilde O(m^{2/3})$ hidden units, a substantial sublinear improvement over naive implementations (Hänni et al., 2024).
Quantum- and VSA-inspired Encodings: Theoretical models analogize superposition to quantum-classical state mixtures or vector-symbolic architectures, employing unitary dynamics, high-dimensional codebooks, and structured superposition/binding with measured channel capacities near 0.5 bits/neuron (Frady et al., 2017, Sun et al., 2020).
Multiple-Input-Multiple-Output (MIMONets): By leveraging variable binding and summation in distributed representations, MIMONets enable simultaneous inference over superposed inputs, offering throughput gains and analytic bounds on cross-channel distortion (Menet et al., 2023).

7. Interpretability, Dictionary Learning, and Implications

Superposition severely compromises conventional feature-interpretability techniques, as underlying concepts are not axis-aligned but dispersed among mixed directions inside the activation space. Recovery of interpretable latent codes necessitates dictionary learning, typically via sparse autoencoders, non-negative matrix factorizations, or similar compressed sensing frameworks (Klindt et al., 3 Mar 2025, Longon et al., 3 Oct 2025). Post-hoc disentanglement is empirically verified to restore alignment metrics between independent models and between networks and brain data, with interpretability and alignment scores jumping significantly after sparse coding (Longon et al., 3 Oct 2025).

By providing a rigorous, information-theoretic basis for quantifying the representational budget—specifically via $\psi = F/N$ —researchers gain a principled means to analyze and compare models, diagnose compression-induced pathologies, and tailor designs for robust, interpretable, and computationally efficient learning (Bereska et al., 15 Dec 2025). The centrality of superposition in model expressivity, adversarial phenomena, and scaling laws positions it as a foundational concept in modern deep learning theory, mechanistic interpretability, and applied large-scale architecture design.