Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry (2503.01822v1)

Published 3 Mar 2025 in cs.LG and cs.AI

Abstract: Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable -- switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability. Overall, we argue an SAE does not just reveal concepts -- it determines what can be seen at all.

Summary

The paper demonstrates that SAE biases dictate which latent concepts are uncovered in neural representations.
It formulates SAE learning as a bilevel optimization using projection nonlinearities to enforce architecture-specific constraints.
Empirical results show that SpaDE, with adaptive sparsity and nonlinear receptive fields, outperforms traditional SAEs in concept recovery.

Duality Between Sparse Autoencoders and Concept Geometry: A Rigorous Analysis

This work provides a rigorous theoretical and empirical analysis of Sparse Autoencoders (SAEs) in the context of model interpretability, introducing a unifying framework that reveals a duality between SAE architectural assumptions and the organizational geometry of "concepts" within model representations. The paper directly challenges the notion that a universal SAE architecture can reliably uncover all relevant underlying concepts used by neural networks, instead demonstrating that each SAE instantiates biases which determine the classes of concepts it can reveal.

Theoretical Contributions

The authors recast the process of learning SAEs as a bilevel optimization problem: the outer optimization aligns with classic sparse dictionary learning, while the inner encodes architecture-dependent constraints as projection operations onto specific sets. This construction—formalized via the notion of "projection nonlinearities"—shows that the encoder's nonlinearity can be interpreted as an orthogonal projection onto an architecture-specific constraint set. For example, ReLU-based encoders project onto the positive orthant, TopK onto the subset of k-sparse nonnegative vectors, and JumpReLU uses a combination of ReLU and thresholded activations, each imposing distinctive geometric constraints on encoded concepts.

The key insight is that these constraints enforce dual assumptions about the structure of concepts present in the data. Thus, an SAE's ability to recover certain conceptual features is tightly coupled to how well the encoder's architectural assumptions match the true geometry of the data's latent factors. The paper demonstrates that popular choices—ReLU, JumpReLU, and TopK—respectively enforce linear separability or angular separability and generally assume uniform intrinsic dimensionality of concepts.

A direct corollary is that concepts violating these assumptions—e.g., possessing nonlinear decision boundaries or heterogeneous intrinsic dimension—are systematically missed by such SAEs. Empirically, this leads to the non-interchangeability of SAE architectures: as shown, training the same model on the same activations with different SAE types can yield divergent concept decompositions, with some features captured exclusively by specific architectures.

Empirical Results

The empirical section progresses from synthetic, controlled datasets to large-scale, naturalistic activations, analyzing two critical real-world properties:

Nonlinear Separability: Many meaningful model features are not linearly separable but instead require more flexible (possibly nonlinear) partitioning in latent space.
Concept Heterogeneity: Latent concepts can be inherently multidimensional, with variable intrinsic dimensionalities.

Key experiments include:

Synthetic Gaussian Clusters (Separability): SAEs assuming linear separability (ReLU, JumpReLU) achieve high F1 scores only on linearly separable clusters; their performance on nonlinearly separable clusters is bounded. TopK, enforcing angular separability, also fails on certain structures. In contrast, a new architecture (SpaDE) incorporating both nonlinear separability and adaptive sparsity achieves perfect F1 on both classes—demonstrating the importance of matching inductive bias to data geometry.
High-dimensional Gaussian Clusters (Heterogeneity): TopK's fixed sparsity prevents it from adapting to clusters of different intrinsic dimension, resulting in poor reconstruction on higher-dimensional concepts unless a prohibitive number of latents is used (see normalized MSE > 20% unless $k$ exceeds true concept dimension). ReLU, JumpReLU, and SpaDE architectures, when tuned appropriately, adapt their effective sparsity and reconstruct each concept with error proportional to its dimension.
Language and Vision Model Activations: On semi-synthetic language datasets structured by syntax, and on real DINOv2 activations from Imagenette, SpaDE exhibits superior specialization of latents (high monosemanticity, low latent co-activation across concepts), outcompeting baselines. In LLMs, latent representations in SpaDE align closely with interpretable parts of speech, even capturing variations in intrinsic dimensionality and non-uniform clustering structures.

Methodological Implications

The authors introduce SpaDE (Sparsemax Distance Encoder), exemplifying how architectural design matching the dual geometry of the data can yield improved interpretability. SpaDE replaces inner-product-based encoders with a Euclidean-distance-to-prototypes operation followed by a Sparsemax projection (projection onto the simplex). This enables:

Nonlinear receptive fields (supporting nonlinear separability)
Adaptive sparsity (supporting heterogeneous concept dimensionality)

Theoretical analysis shows that SpaDE's encoder is piecewise linear and that its receptive fields are unions of convex polytopes, providing geometric flexibility that standard SAEs lack.

Bold Claims and Empirical Support

The paper makes several strong claims, all supported both theoretically and empirically:

No single SAE architecture is universally optimal: Systematically demonstrated via both synthetic and real-data experiments—different SAEs recover disjoint sets of concepts when their inductive biases are mismatched with data geometry.
Standard SAEs may miss entire classes of concepts: Shown both with poor F1 scores and reconstructive performance on mismatched data structures.
Tailoring SAE architecture to data geometry is essential: Incorporating even simple geometric properties (via SpaDE) leads to the emergence of previously undetected concepts.

Practical Considerations

Implementation of SpaDE and similar geometry-aware SAEs may bring overhead due to per-sample computation of distances and sparsemax projections, especially on high-dimensional data. However, these operations are parallelizable and tractable for moderate dictionary sizes. For practitioners seeking semantic interpretability, careful matching of SAE constraint geometry to the target data domain is now strongly motivated.

For instance, deploying SAEs for model editing or steering in language and vision models may require empirical pilot runs to probe the underlying geometry (e.g., separability, dimensionality) of key concepts before selecting or tuning the autoencoder architecture. Over-specialization may emerge if sparsity regularization is aggressive or if the data geometry is not well-understood, underscoring the importance of validation on realistic downstream interpretability tasks.

Implications and Future Directions

This work provides a robust formal grounding for debates around the reliability and universality of SAE-based interpretability. It likely explains several observed algorithmic instability phenomena and negative results in the community, as the optimization landscape and the nature of the extracted features are fundamentally architecture-dependent.

A direct implication is that interpretability research should prioritize data-aware SAE design, possibly via weak supervision or prior knowledge about latent factor geometry (analogous to developments in the disentanglement literature). Extensions may include:

Learning “meta-SAEs” where the projection set itself is learned or regularized to match the latent structure.
Hybrid models incorporating non-Euclidean similarity measures or manifold-aware projections.
Automatic geometry probing pipelines to suggest suitable SAE architectures for arbitrary domains.

Moreover, future work may investigate the integration of domain-specific priors, dataset-specific geometric constraints, or multi-stage interpretability pipelines that adjust SAE assumptions iteratively as new geometric evidence accumulates.

Conclusion

This paper establishes that SAE-based concept discovery is fundamentally constrained by the architectural dual assumptions encoded within the encoder’s projection nonlinearity. There is a strong imperative for interpretability researchers and practitioners to move beyond default architectures, and to design, select, and tune SAEs with explicit reference to the underlying geometry of model representations. This theoretical and empirical reframing offers a new foundation for both principled critique and advancement in the design of interpretable representation learning systems.

PDF Markdown

Tweets

https://twitter.com/SeonglaeC/status/1900918988961353729