Do Sparse Autoencoders Capture Concept Manifolds?

Published 30 Apr 2026 in cs.LG and cs.AI | (2604.28119v1)

Abstract: Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so, and how? We develop a theoretical framework that answers these questions and show that SAEs can capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, we find that SAEs suboptimally recover continuous structures, mixing the global subspace and local tiling solutions in a fragmented regime we call dilution. This explains why manifold structure is rarely visible at the level of individual concepts and motivates post-hoc unsupervised discovery methods that search for coherent groups of atoms rather than isolated directions. More broadly, our results suggest that future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper demonstrates that sparse autoencoders recover concept manifolds as fragmented, tiled representations rather than compact, linear subspaces.
It introduces an additive mixture model that generalizes the linear representation hypothesis to accommodate nonlinear, curved geometries.
Empirical analysis on synthetic benchmarks and LLM activations shows that feature tiling and Ising coupling methods effectively uncover manifold structures.

Sparse Autoencoders and the Geometry of Concept Manifolds

Background: Concept Geometry and Linear Representation Hypothesis

Sparse Autoencoders (SAEs) have been widely adopted as tools for unsupervised concept extraction from neural network activations, motivated by the assumption that abstract concepts correspond to independent directions in embedding space, as posited by the Linear Representation Hypothesis (LRH). Under LRH, the value of any concept can be isolated with a linear readout, and neural representation is modeled as an additive mixture of such directionally encoded factors. However, recent work has questioned this direction-centric paradigm, emphasizing that many data-generating concepts manifest as continuous, low-dimensional manifolds embedded in high-dimensional representation spaces.

Empirical and theoretical research has revealed manifold encodings for various attributes—color hues organized on circles, days arranged on cycles, temperature and age on lines, hierarchical trees for geographic information, and even more complex topologies such as tori and helices—within LLM activations. These geometries are seen not only in visualizations (e.g., PCA projections) but are also causally relevant; interventions along manifold axes yield smooth, semantically coherent changes in model outputs, underscoring their functional importance.

Theoretical Framework: Additive Mixture of Manifolds and Sparse Coding

This work introduces the "Additive Mixture of Manifolds" model, generalizing LRH to accommodate multidimensional, nonlinearly curved geometric structures. Representations $x$ decompose as $x = \sum_i f_i(m_i)$ , with each $f_i$ mapping a compact submanifold $M_i$ into the ambient space. When $M_i$ are rays, LRH is recovered; otherwise, representations span low-dimensional surfaces. This generalization puts forth new criteria for evaluating the ability of SAEs to recover manifold structures:

Subspace Capture: An SAE captures a manifold $M$ at precision $\epsilon$ if a compact group of atoms spans a subspace containing $M$ , and the encoder consistently selects this group for every input on $M$ .
Tiling: Beyond subspace capture, SAEs may tile a manifold using features that individually specialize to restricted regions. Features act as localized detectors with overlapping receptive fields, and the manifold geometry is collectively encoded by their joint activity.

Theoretical analysis connects these regimes to classical sparse coding and sparse subspace clustering. Subspace-preserving recovery is guaranteed under precise sparsity and incoherence conditions (e.g., coherence-based ERC), but practical SAEs often operate well outside idealized settings.

Empirical Analysis: Fragmented Manifold Recovery and Dilution in SAEs

Comprehensive experiments on synthetic benchmarks (eight manifold types and controlled mixtures) and real LLM activations demonstrate that SAEs seldom achieve compact capture. Instead, they fragment manifold geometry across a partially redundant set of features—a regime termed dilution. Restricted support size for manifold reconstruction consistently exceeds the ambient dimension, and variance explained plateaus, indicating inefficiency in feature allocation.

Analysis of feature activations reveals that individual SAE features display tuning curves with selective responses to particular regions of a concept manifold (periodic activation for years, selectivity for days or hues, etc.), reminiscent of population coding in neuroscience. Different architectural variants (e.g., TopK, BatchTopK, JumpReLU, Matryoshka) exhibit biases in how ambient space is partitioned, but all ultimately support the tiling model.

Practical Tools: Unsupervised Discovery via Ising Coupling

Recovering coherent manifold structure requires grouping features by functional rather than geometric similarity. The paper proposes a pipeline leveraging pairwise Ising models on binarized SAE codes, inferring direct statistical interaction among features and disentangling structural co-activation from incidental correlations. Block-diagonal structure in Ising coupling matrices aligns with ground-truth manifold partitions, outperforming traditional metrics such as decoder cosine similarity or Pearson correlation.

This approach enables both recovery of known manifolds (temperature, colors, political bias) and discovery of novel geometric objects (e.g., epistemic uncertainty), suggesting its utility for hypothesis generation in AI interpretability.

Limitations and Architectural Mismatch

SAEs, originally designed for sparsity and orthogonal directions, lack architectural bias toward manifold-centric representation. As a result, the geometric structures recovered are curved and complex, and their interpretation at the level of individual atoms is unreliable. Post-hoc recovery is possible yet limited; mixed-selectivity features dilute co-activation signals, and only the union of features spanning or tiling a manifold is meaningful.

Point-based dictionary learning (convex hulls of landmarks) is shown to fundamentally fail for factorwise manifold recovery under superposition. Only direction-based additive architectures can isolate constituent factors, motivating future research on compositional featurization strategies.

Implications and Future Directions

The findings have substantial implications for representation learning and interpretability:

Interpretability Unit: The atomic unit for interpretability should shift from isolated directions to geometric objects—manifolds or their fragments—encoded collectively by feature groups.
Protocol Design: Featurization protocols need to actively target manifold recovery, either through geometric regularization during training or sophisticated post-hoc analysis.
Theory Advancement: Understanding neural network representations requires adopting geometric frameworks, bridging sparse coding, subspace clustering, manifold learning, and network science.

Practically, this work suggests that tools for debugging, controlling, or steering models should operate on geometric feature communities rather than isolated atoms. The theoretical framework, empirical characterization, and unsupervised discovery methods outlined here are foundational steps toward such protocol design.

Conclusion

SAEs can, in principle, capture structured, nonlinear geometries in neural representations, but in practice do so in a fragmented, diluted manner. Interpretability based on individual directions is unreliable; instead, geometric objects—manifolds—must be seen as the basic units, recovered and manipulated through groups of features. Future developments should integrate geometric bias into featurization and advance analysis tools capable of operating on these structured units. Understanding the geometry of representations, rather than mere dictionaries of concepts, is a critical frontier for AI interpretability and theory.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

There was an error generating the whiteboard.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What this paper is about

This paper asks a simple but important question: when we look inside big AI models (like LLMs), how are ideas like “day of the week,” “temperature,” or “color” stored? Many people assume each idea points in a single straight direction inside the model’s “activation space.” The authors show that this is often not true. Instead, many ideas are spread out along smooth shapes—like loops or surfaces—called manifolds. They then study whether a popular tool called a Sparse Autoencoder (SAE) can find and explain these curved shapes.

Key objectives and questions

The paper focuses on three easy-to-understand questions:

What does it mean for an SAE to “capture” a curved shape (a manifold) inside a model’s internal world?
When do current SAE designs actually do this—and how?
What do we do if SAEs don’t cleanly show these shapes?

Methods and how to think about them

First, some simple definitions and analogies:

Manifold: Think of a curved track or surface. For example, “days of the week” might form a loop (Monday back to Sunday to Monday), and “temperature” might trace a smooth line from cold to hot.
Sparse Autoencoder (SAE): A tool that tries to rebuild a model’s internal activations using only a few “features” at a time. Imagine you have many light switches but you only flip a few to recreate a scene.
Features/Atoms: The switches the SAE uses. Each feature points in some direction and helps rebuild the original signal.
PCA (Principal Component Analysis): A simple way to find the main directions of variation. Think of it as rotating the data to see its shape more clearly.
Ising model (from physics): A way to study which switches tend to turn on together or avoid each other, like seeing which students always sit together or never do.

What they did:

Showed manifold shapes exist in real models

They collected many examples that vary a single concept (like color hue, age, day, or temperature) from a real LLM.
Using PCA, they found the points lay along smooth shapes (for example, days forming a loop).
When they gently “steered” the model along these shapes, the model’s outputs changed smoothly and sensibly (e.g., moving from “Wednesday” toward “Thursday” changed related word probabilities smoothly). This means the shapes matter for behavior.

Built a theory for how SAEs could capture manifolds

Best case (global capture): A small fixed team of features works together everywhere on the shape, like a few spotlights lighting the entire track.
Local capture (tiling): Many features each cover a small patch of the shape, like tiles covering a curved floor. The whole shape only appears when you look at all the tiles together.

Ran controlled tests with known shapes

They created synthetic datasets with known curved shapes (circles, spheres, tori, etc.) hidden inside a high‑dimensional space.
They trained SAEs under different settings to see whether they’d recover the shapes as whole teams (global capture) or as many patches (tiling).

Analyzed real LLMs with different SAE designs

They trained several types of SAEs on a real model’s internal activations and checked how the features behaved.
They used the Ising model to group features that tend to activate together (or avoid each other), which helps recover the full curved shapes from many small pieces.

Main findings and why they matter

Here are the core takeaways, explained simply:

Many concepts are curves or surfaces, not single straight directions The model’s internal world organizes ideas like days and colors along smooth shapes. This matches how these concepts behave in real life (days loop; temperatures vary smoothly).
SAEs can represent manifolds in two ways:
- Global capture: a small, stable set of features spans the whole shape.
- Tiling: many localized features each cover a region; together, they trace the shape.
In practice, current SAEs mostly do tiling—and often in a “diluted” way Instead of picking a small, consistent team of features, they spread the job across many overlapping features. This makes the structure hard to see when you look at any single feature by itself.
There is a “sweet spot” for sparsity In controlled tests, there’s a middle ground where SAEs best capture the shapes with small teams. Too few active features “shatter” the shape into unrelated chunks; too many cause “dilution,” where everything overlaps and becomes messy.
Grouping features by co-activation works better than comparing directions Looking at which features turn on together (or avoid each other) helps discover the curved shapes. The Ising-based grouping outperforms simple “angle” comparisons between features. This also helped find a new manifold related to scientific uncertainty (how text signals measurement error and imprecision).

Why this matters:

It explains real frustrations people have with SAEs:
- Features can look unstable across training runs (different tilings can cover the same shape).
- Editing or steering with one feature often fails (one feature covers only a small patch).
- Automated labeling of single features is hard and sometimes misleading.

Implications and impact

Rethink interpretability: Don’t just look for single directions or single features. Instead, look for geometric objects—curves and surfaces—and the groups of features that form them.
Better tools: We need new methods that aim directly at finding and representing manifolds, not just directions. Until then, post-hoc grouping (like the Ising approach) is a useful workaround.
More reliable editing and control: If we can identify the full group of features that together form a manifold, we can steer the model more smoothly along meaningful paths (like moving one day forward or adjusting temperature descriptions).
Clearer science of representations: Treating “geometry” as the basic unit (not isolated features) can help us explain model behavior, compare models, and build safer, more understandable systems.

In short, this paper argues that meaning in AI models often lives on smooth shapes, and that we should design our interpretability tools to find and work with these shapes—not just single directions.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, phrased to guide future research.

External validity across models and layers: Results are shown primarily for Llama3.1‑8B at a single intermediate layer (residual stream, layer 19). It is unknown whether the same tiling/dilution behavior, capture failures, and Ising-based recovery hold across earlier/later layers, different streams (MLP, attention), architectures (e.g., Mistral, GPT‑NeoX, T5), and model sizes.
Modal and domain generality: The study focuses on LLMs and a few language-manifold concepts. Whether analogous manifold tiling/dilution occurs in vision, audio, and multimodal models (e.g., ViT, CLIP, Whisper) remains untested.
Concept coverage and construct validity: Only a small set of curated “continuous” concepts (e.g., days, colors, temperature, years) is probed. It is unclear how broadly manifolds appear for less curated, compositional, or highly context-dependent concepts, and how to systematically generate probe datasets for them.
Additive mixture assumption: The framework assumes representations are an additive mixture (Minkowski sum) of immersed manifolds. There is no empirical test of additivity in real activations, nor an analysis of when nonlinear interactions or non-additive composition break the model.
Identifiability under superposition: When multiple manifolds overlap in ambient subspaces or co-occur frequently, it is unclear what identifiability guarantees (if any) exist for separating their atom groups, even with perfect reconstruction.
Theoretical training guarantees: Theorem 1 gives subspace-capture conditions under an idealized decoder, p-incoherent dictionaries, and aligned sparsity, but there is no characterization for trained SAEs with amortized ReLU encoders, realistic regularizers, or optimization noise. Conditions that predict capture vs tiling vs dilution during training are missing.
Dilution/shattering transition theory: The empirical “phase diagram” across sparsity k is documented, but a formal analysis predicting transition thresholds as a function of dictionary width, sparsity, regularization strength, and data geometry is absent.
Impossibility claims unspecified: The paper mentions an impossibility claim for current SAE architectures “directly motivated by manifold learning” but does not detail the assumptions, scope, or constructive counterexamples in the main text.
Metrics for manifold fidelity: Evaluation in LLMs uses restricted R² and PCA visuals; there are no topology- or geometry-aware metrics (e.g., geodesic distortion, neighborhood preservation, homology/Betti numbers, curvature estimates) to quantify whether SAE reconstructions preserve manifold structure.
Ground-truth scarcity in real data: Outside synthetic benchmarks, there is no ground truth for manifold membership. The paper lacks a methodology to validate discovered groups without supervision or to estimate false discovery rates for manifold hypotheses.
Robustness of Ising-based grouping: The Ising approach relies on binarized codes and pairwise couplings. Open questions include sensitivity to binarization thresholds, regularization choice, dataset shifts, and run-to-run variance; how well it scales to very large dictionaries; and whether higher-order interactions (beyond pairwise) are necessary.
Disentangling co-occurrence confounds: Although the Ising model conditions on other features, there is no quantitative assessment of residual confounding from dataset co-occurrence patterns or latent common causes, nor counterfactual tests that separate structural co-activation from statistical co-occurrence.
Mixed-selectivity handling: The paper notes mixed-selectivity atoms but provides no method to attribute a feature’s activation mass across multiple manifolds or to re-factor features into manifold-specific components.
Architecture–tiling bias mapping: Observed architectural biases (e.g., “angular separability” in TopK vs “linear separability” in L1) are shown qualitatively. There is no systematic analysis linking SAE architectural choices and hyperparameters (expansion, k, sparsity penalty) to tiling granularity, overlap, and interpretability.
Encoder amortization gap: Theoretical guarantees reference an ideal sparse decoder; the practical “amortization gap” from learned encoders is neither quantified nor reduced via training strategies (e.g., unrolled inference, encoder–decoder consistency losses).
Causal manipulation with groups: While centroid-steering shows smooth behavioral changes, the paper does not test interventions using discovered feature groups to move along a manifold while staying on it, nor assess unintended off-manifold side effects and reliability under distribution shift.
Learning objective design: The paper argues for manifold-aware featurizers but does not propose or evaluate concrete objectives/architectures (e.g., multi-dimensional atoms, spline/surface decoders, topology-regularized losses, geodesic/neighbor-preserving penalties) to encourage compact capture over tiling/dilution.
Comparison to alternative methods: There is no quantitative comparison against non-SAE approaches for manifold discovery/recovery (e.g., diffusion maps, Isomap, LLE, Laplacian eigenmaps, sparse manifold transforms, nonlinear dictionary learning, subspace clustering variants) on identical data.
Scalability and efficiency: Computing pairwise Ising couplings over large dictionaries is potentially prohibitive. The paper does not report complexity, approximations (e.g., sparse graphical models, neighborhood selection), or performance for 100k+ atom settings used in practice.
Topology beyond smooth manifolds: Many neural representations may form stratified/branched structures or graphs rather than smooth manifolds. The current framework and evaluations do not address non-smooth, piecewise, or branching geometries.
Layerwise manifold evolution: How manifold geometry (dimension, curvature, separability) evolves across layers and how SAE behavior co-evolves (capture vs tiling) is not explored.
Data dependence: SAEs are trained on The Pile (500M tokens). It is unknown how corpus/domain composition, prompt templates, and tokenization choices affect discovered manifolds and grouping stability.
Reliability across seeds/runs: The paper notes dictionary instability but does not quantify variability of manifold groups, coupling matrices, or downstream reconstructions across random initializations and training seeds.
Evaluation for downstream utility: Beyond visualization, there is no task-based evaluation showing that manifold-based groupings improve interpretability workflows, controllable generation, or debugging compared to direction-based analyses.
Safety and misuse considerations: Group-level steering along manifolds could have broader behavioral effects than single-feature edits; the paper does not discuss guardrails or failure analyses for such interventions.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be deployed with today’s models and tooling (e.g., existing SAEs and the paper’s unsupervised Ising-based pipeline).

Industry (Software/ML): Post-hoc manifold discovery from SAE codes
- What: Cluster SAE features into coherent concept manifolds using the paper’s Ising-based coupling model on binarized activations, then visualize and probe those groups.
- Workflow/Tools: Fit pairwise Ising model on SAE codes; community detection on coupling matrix J; project inputs onto span of decoder atoms in each group; inspect reconstructed geometry (PCA); visualize receptive fields/tuning curves.
- Assumptions/Dependencies: Access to model activations and trained SAE; sufficient data to estimate couplings (pseudolikelihood/graphical lasso); compute overhead for fitting Ising models; works best when manifolds are reasonably sampled.
Industry (MLOps/Safety) + Policy: Manifold-aware auditing and monitoring
- What: Audit and monitor behaviors along identified manifolds (e.g., political bias, epistemic uncertainty, style, temporal concepts) rather than single features.
- Workflow/Tools: Build dashboards around restricted-R² curves, support size, receptive-field spread, and Ising coupling blocks; set alerts for shifts in manifold group couplings/activation distributions across training snapshots or deployments.
- Assumptions/Dependencies: Internal interpretability stack; acceptance that group-based units are more faithful than individual directions; access to periodic model snapshots for monitoring drift.
Industry (Product/UX for LLM Apps): Robust steering via manifold navigation
- What: Replace single-feature steering with movement along manifold paths (e.g., interpolate between centroids of “Wednesday”→“Thursday,” “cold”→“warm,” “informal”→“formal”).
- Workflow/Tools: Compute manifold centroids; perform controlled interpolation in representation space; decode or use representational steering hooks; validate behavior smoothness via token probability transitions.
- Assumptions/Dependencies: Ability to extract and intervene in hidden representations; validation set to ensure steering remains on-manifold and task-relevant; guardrails against off-manifold artifacts.
Academia/Research: Standardized diagnostics for geometry capture
- What: Use the paper’s metrics to compare SAEs and alternative featurizers: restricted-R² vs. support size, phase identification (capture/shatter/dilution), and Ising coupling structures.
- Workflow/Tools: Adopt the synthetic benchmark (circles, tori, Möbius, etc.) and provided code; run sweeps over sparsity and expansions; evaluate block-diagonality of J and reconstruction from grouped atoms.
- Assumptions/Dependencies: Reproducible training of SAEs/featurizers; compute for hyperparameter sweeps; awareness that current SAEs often operate in dilution regime.
Industry (Data Engineering) + Academia: Data curation along concept manifolds
- What: Ensure training/eval prompts cover the full range of continuous concepts (e.g., temperature ranges, color hue/lightness, days/time) to avoid sparse or biased manifold regions.
- Workflow/Tools: Generate prompt templates that sweep intrinsic variables; verify manifold coverage using PCA and tuning curves; augment underrepresented segments.
- Assumptions/Dependencies: Capability to synthesize prompts systematically; reliable identification of manifold axes via centroids or supervised seeds.
Industry (Interpretability/Compliance): Reporting practices that reflect manifold structure
- What: Report feature-group-level findings (e.g., identified manifolds and their atom groups) rather than isolated feature interpretations; note expected instability of dictionaries across runs.
- Workflow/Tools: Bundle atom groups with coupling matrices, community assignments, and subspace projections in interpretability reports.
- Assumptions/Dependencies: Organizational agreement to shift from direction-centric to group-centric interpretability; tooling to export and version groups.
Cross-sector (Healthcare, Finance, Scientific QA): Improved readouts using manifold coordinates
- What: Build simple readouts/regressors from projections onto the span of manifold groups (e.g., map latent “uncertainty” manifold to calibration score; map physiological/temporal manifolds to interpretable scalars).
- Workflow/Tools: Project hidden states onto identified group subspace; fit low-dimensional regressors/classifiers; validate monotonicity and smoothness.
- Assumptions/Dependencies: Domain-specific validation sets; confidence that identified manifold captures relevant semantics; careful handling of mixed-selectivity artifacts.
Education/Communication: Visual teaching aids for representation geometry
- What: Use PCA plots, tuning curves, and piecewise-linear reconstructions to teach continuous concept representations and population coding.
- Workflow/Tools: Classroom notebooks with PCA and decoding demos; interactive sliders that traverse manifolds.
- Assumptions/Dependencies: Access to pretrained models or saved activations; ethical/secure datasets for demos.
Vision/CV and Multimodal Labs: Immediate cross-modal probes
- What: Apply the same Ising-grouping and tiling diagnostics to vision or vision-language embeddings (e.g., hue, pose, spatial axes) to recover manifold groups.
- Workflow/Tools: Train SAEs on vision-layer activations; reuse the Ising pipeline; analyze receptive fields in image latent space.
- Assumptions/Dependencies: Domain adaptation of prompts/data (e.g., controlled pose/lighting); adequate sampling of continuous factors.
Deployment/Monitoring: Lightweight telemetry using group activations
- What: Monitor only manifold-group activation statistics (fields and couplings) to reduce telemetry volume and focus on meaningful structure.
- Workflow/Tools: Aggregate per-group activation rates and coupling summaries; drift detection on group statistics.
- Assumptions/Dependencies: Stable group discovery in pre-deployment; privacy constraints on activation logging.

Long-Term Applications

These require new methods, scaling work, or integration into model training and product ecosystems.

Core ML (Representation Learning): Manifold-aware autoencoders/featurizers
- What: Replace 1D atoms with multidimensional/subspace or spline-based atoms that explicitly parameterize curved manifolds; incorporate group sparsity and topological priors.
- Potential Tools/Products: “Manifold SAE” libraries with subspace atoms; curvature-aware losses; topology-preserving regularizers.
- Assumptions/Dependencies: Algorithmic advances; stable optimization; benchmarking against the paper’s synthetic suite.
Industry (Model Control) + Safety: Manifold-constrained editing and guardrails
- What: Editors that move representations within identified manifolds (not off them) for reliable style, bias, tone, or temporal adjustments; reject edits that deviate off-manifold.
- Potential Tools/Products: “Manifold Steering API” integrating centroid paths, geodesic approximations, and constraint solvers.
- Assumptions/Dependencies: Accurate manifold discovery and tracking over fine-tuning; efficient on-the-fly projection/steering at inference; defenses against distribution shift.
Systems/Compression: Manifold-group modularization and pruning
- What: Treat manifold groups as functional modules for pruning, specialization, or routing; compress by retaining manifold-critical groups for target tasks.
- Potential Tools/Products: “Manifold Router/Pruner” that prunes or gates at group granularity; module distillation pipelines.
- Assumptions/Dependencies: Demonstrated task performance retention; stability of groups across domains and updates.
Governance/Policy: Standards for group-based interpretability and audits
- What: Establish guidelines that interpretability claims must be at the level of coherent feature groups/manifolds; require manifold drift and off-manifold detection in safety audits.
- Potential Tools/Products: Audit checklists and reporting templates; compliance suites measuring coupling structures and coverage.
- Assumptions/Dependencies: Community consensus; regulator acceptance; reproducible, standardized metrics.
Reliability/Anomaly Detection: Off-manifold monitors
- What: Detect when internal activations leave known manifolds (e.g., abnormal inputs or emergent failure modes) and trigger mitigations.
- Potential Tools/Products: Real-time projection residual monitors; “Manifold Health” scores integrated into serving stacks.
- Assumptions/Dependencies: Robust manifold models; low-latency projections; low false-positive rates.
Training-Time Objectives: Geometry-aligned representation learning
- What: Losses that reward compact capture of manifolds (e.g., small group span with high restricted-R²) and penalize excessive dilution/shattering; curriculum that sweeps continuous factors.
- Potential Tools/Products: Geometry-regularized pretraining/fine-tuning recipes; synthetic manifold curricula.
- Assumptions/Dependencies: Reliable manifold supervision signals (unsupervised or weakly supervised); compatibility with large-scale training.
Domain Science (Healthcare, Genomics, Scientific NLP): Manifold discovery for latent variables
- What: Use unsupervised manifold groups to surface latent scientific structures (e.g., disease progression axes, measurement uncertainty, experimental conditions) and build calibrated tools.
- Potential Tools/Products: “Manifold Inspector” for scientific corpora; calibrated evidence/uncertainty readouts in scientific assistants.
- Assumptions/Dependencies: Domain validation; careful handling of spurious correlations; integration with expert workflows.
Robotics/Control: Geometry-aware policies
- What: Use manifold coordinates for continuous control factors (pose, trajectory phase, affordances) in policy learning and interpretability.
- Potential Tools/Products: Latent-manifold controllers; interpretable policy sliders for operators.
- Assumptions/Dependencies: Extension of findings from language to control embeddings; data to map control factors onto manifolds; safety validation.
Consumer Apps (Writing/Design Tools): User-facing manifold sliders
- What: Expose continuous controls (e.g., formality, sentiment, color semantics) backed by manifold-aware steering, delivering predictable, smooth adjustments.
- Potential Tools/Products: UI components linked to manifold coordinates; presets anchored at manifold centroids.
- Assumptions/Dependencies: Real-time, stable steering; clear semantics for end-users; safeguards to avoid harmful shifts (e.g., political bias).
Cross-Model Portability: Manifold alignment across model variants
- What: Align manifold groups across model sizes/versions to stabilize interpretability, monitoring, and control during upgrades.
- Potential Tools/Products: “Manifold Aligner” that matches coupling communities and spans across checkpoints; transfer of steering policies.
- Assumptions/Dependencies: Partial invariance of representation geometry across architectures/training; alignment algorithms robust to dictionary permutations.

Notes on feasibility and key assumptions across applications

Current SAEs typically operate in the dilution regime; coherent manifold recovery requires post-hoc grouping (e.g., Ising pipeline). Results depend on data coverage along the manifold and on access to intermediate activations.
Steering and auditing assume manifold structure causally influences behavior (supported in the paper by centroid interpolation experiments) and that projections/edits stay near the manifold.
The Ising approach reduces spurious correlations but needs sufficient samples and careful regularization; coupling estimates may degrade under heavy superposition or mixed selectivity.
Long-term advances hinge on new featurizers and objectives that treat manifolds—not single directions—as the basic unit of interpretability.

View Paper Prompt View All Prompts

Glossary

Additive Mixture of Manifolds: A representation model where concepts vary over low-dimensional manifolds that add together in activation space. Example: "Definition 2 (Additive Mixture of Manifolds)."
affine subspace: A linear subspace translated by a vector; the smallest flat space containing a manifold. Example: "Let M lie in a k-dimensional affine subspace with orthonormal basis V."
amortization gap: The discrepancy between an idealized decoder and the trained encoder’s practical performance. Example: "including a discussion of the amortization gap between the idealized decoder and the trained encoder."
ambient dimensionality: The dimension of the surrounding space in which a manifold is embedded. Example: "with ambient dimensionality dim(Mi) = d; < d."
ambient space: The high-dimensional space in which a lower-dimensional manifold resides. Example: "We further explore the selectivity of features by plotting SAE feature activations in the ambient space (defined by the top 3 principal components) a manifold lives in."
atoms (decoder atoms): Basis elements (rows of the SAE decoder) used to reconstruct inputs; fundamental units in the dictionary. Example: "allocating a compact group of atoms whose linear span contains the entire manifold"
BatchTopK: An SAE variant that selects top-k activations per batch for sparsity. Example: "We train five SAE architectures: Standard (§1), JumpReLU (Rajamanoharan et al., 2024), TopK (Gao et al., 2024), BatchTopK (Bussmann et al., 2024), and Matryoshka (Bussmann et al., 2025)"
co-activation statistics: Measures of how often features activate together across inputs. Example: "one must use co-activation statistics"
cone (non-negative span): A set closed under positive scaling and addition; here, the sparse non-negative span of decoder atoms. Example: "Consequently, the localized reconstructions & = z D lie in a sparse non-negative span (a cone)."
dictionary learning: Learning a set of basis vectors (atoms) enabling sparse representations of data. Example: "The proof relies on classical results in sparse dictionary learning"
dilution: A regime where many redundant atoms represent a manifold with overlapping, mixed-selectivity features. Example: "mixing the global subspace and local tiling solutions in a fragmented regime we call dilution."
Grassmannian frame: A set of directions arranged to maximize minimal pairwise angles; used to pack directions efficiently. Example: "packed as a Grassmannian frame (Strohmer & Heath Jr, 2003)."
immersion map: A smooth injective map embedding a manifold into a higher-dimensional space. Example: "Let fi : Mi > Rd be the immersion maps from each submanifold into Rd."
incoherence (p-incoherent dictionary): A property limiting similarity between dictionary atoms to enable sparse recovery. Example: "Let D be p-incoherent"
interaction matrix J: The matrix of pairwise interactions in an Ising model capturing dependencies among feature activations. Example: "These regimes therefore induce distinct signatures in the interaction matrix J."
Ising couplings: Pairwise interaction parameters in the Ising model reflecting cooperative or inhibitory relationships. Example: "Ising couplings and conditional co-activation yield the cleanest separation"
Ising model: A probabilistic model with binary variables and pairwise interactions used here to model feature co-activation. Example: "we model the joint activation statistics of SAE features using a pairwise Ising model over binarized codes"
JumpReLU: An SAE activation variant introducing jump-like behavior to encourage sparsity/structure. Example: "JumpReLU (Rajamanoharan et al., 2024)"
Linear Representation Hypothesis (LRH): The assumption that concepts correspond to independent linear directions in representation space. Example: "Called the Linear Representation Hypothesis (LRH)"
Minkowski sum: The set of all sums of points from two sets; here, sums of manifold embeddings. Example: "x lives in a Minkowski sum of the immersed submanifolds Mi."
overcomplete dictionary: A dictionary with more atoms than the ambient dimension, enabling sparse representations. Example: "the goal is to extract its underlying generative factors using an overcomplete dictionary."
piecewise-linear: Composed of linear segments; here, approximating manifolds via unions of linear patches. Example: "approximates the manifold in a piecewise-linear fashion"
pointwise mutual information (PMI): A measure of association between two events/features based on their co-occurrence probabilities. Example: "pointwise mutual information"
population code: Representation where information is distributed across many units with overlapping tuning. Example: "via the population code (Khona & Fiete, 2022; Eichenbaum, 2018)."
receptive field: The region of input (or manifold) space to which a feature is sensitive. Example: "Each atom then acts as a localized detector with a receptive field on the manifold"
Restricted R2: A reconstruction quality metric calculated while restricting to selected atoms. Example: "Restricted R2 measures whether ki atoms suffice to reconstruct each manifold from the superposed codes."
row span: The linear span of the rows of a matrix; the set of all linear combinations of its rows. Example: "Im(V) ={xV : x € Rk} @ Rd for its row span"
sparsity budget: The allowed number of active features per input in a sparse model. Example: "train TopK SAEs across a range of sparsity budgets"
sparsity-promoting regularizer: A penalty encouraging solutions with few nonzero activations. Example: "R(z) is a sparsity-promoting regularizer"
Sparse Autoencoders (SAEs): Models that reconstruct inputs via a sparse latent code and learned dictionary. Example: "Definition 1 (Sparse Autoencoders)."
sparse coding: A framework assuming data are generated by sparse combinations of latent factors. Example: "sparse coding assumes a generative model where data points are produced by a sparse linear combination of latent variables"
sparse subspace clustering: Clustering method assuming data lie in a union of low-dimensional subspaces and seeking subspace-preserving representations. Example: "the subspace-preserving recovery condition that grounds sparse subspace clustering"
subspace capture: The criterion that a fixed small set of atoms spans the subspace containing a manifold and consistently activates for its points. Example: "Definition 3 (Subspace capture)."
subspace-preserving recovery condition: A condition ensuring points are represented using atoms from their own subspace. Example: "the subspace-preserving recovery condition that grounds sparse subspace clustering"
superposition: The additive overlap of multiple manifold-derived components in representations. Example: "Importantly, superposition arises when _; ki > d."
support size: The number of active atoms used to reconstruct a point. Example: "The phase diagram tracks this transition via support size and receptive field spread"
tiling: Representing a manifold via many localized features that collectively cover its geometry. Example: "we call this phenomenon tiling: localized features with overlapping support whose joint activity encodes position along the manifold."
TopK: An SAE mechanism selecting the k largest activations per input to enforce sparsity. Example: "TopK (Gao et al., 2024)"
tuning curves: Smooth response profiles of features as a function of a latent variable’s value. Example: "We showcase "tuning curves" (Butts & Goldman, 2006)"
variance explained: The fraction of variance in activations reconstructed by selected features or models. Example: "Variance explained as a function of the number of restricted features, averaged across manifolds and SAE archi- tectures."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Do Sparse Autoencoders Capture Concept Manifolds?

Summary

Sparse Autoencoders and the Geometry of Concept Manifolds

Background: Concept Geometry and Linear Representation Hypothesis

Theoretical Framework: Additive Mixture of Manifolds and Sparse Coding

Empirical Analysis: Fragmented Manifold Recovery and Dilution in SAEs

Practical Tools: Unsupervised Discovery via Ising Coupling

Limitations and Architectural Mismatch

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What this paper is about

Key objectives and questions

Methods and how to think about them

Main findings and why they matter

Implications and impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets