Finding Belief Geometries with Sparse Autoencoders

Published 3 Apr 2026 in cs.LG and cs.AI | (2604.02685v1)

Abstract: Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether LLMs trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), $k$-subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters exhibiting candidate simplex geometry ($K \geq 3$). A key challenge is distinguishing genuine belief-state encoding from tiling artifacts: latents can span a simplex-shaped subspace without the mixture coordinates carrying predictive signal beyond any individual feature. We therefore adopt barycentric prediction as our primary discriminating test. Among the 13 priority clusters, 3 exhibit a highly significant advantage on near-vertex samples (Wilcoxon $p < 10^{-14}$) and 4 on simplex-interior samples. Together 5 distinct real clusters pass at least one split, while no null cluster passes either. One cluster, 768_596, additionally achieves the highest causal steering score in the dataset. This is the only case where passive prediction and active intervention converge. We present these findings as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B's representation space, and identify the structured evaluation that would be required to confirm this interpretation.

Abstract PDF Upgrade to Chat

Authors (1)

Matthew Levinson

Summary

The paper presents a discovery pipeline combining sparse autoencoders, k-subspace clustering, and neural simplex fitting to identify simplex-structured belief geometries in language model activations.
It demonstrates robust recovery of simplex components, achieving significant barycentric predictive advantages and mean R² improvements in both toy models and Gemma-2-9B data.
Key insights include distinguishing genuine mixture encoding via causal steering and predictive tests, suggesting implicit Bayesian inference in natural language models.

Detailed Analysis of "Finding Belief Geometries with Sparse Autoencoders" (2604.02685)

Background and Motivation

The paper addresses the geometric structure of latent representations in large pretrained LLMs, focusing on the presence and functional significance of simplex-shaped subspaces encoding probabilistic belief states ("belief geometries"). Prior evidence exists for such geometric encodings in transformers trained on hidden Markov models (HMMs), where barycentric coordinates of simplices correspond to belief distributions over discrete latent states [shai2024transformers, piotrowski2025constrained]. This work investigates whether analogous structures arise in residual streams of LLMs trained on naturalistic text, without access to ground-truth latent information.

Methodology: Belief Geometry Discovery Pipeline

The proposed discovery pipeline integrates sparse autoencoders (SAEs), $k$ -subspace clustering, and neural simplex fitting via AANet. The approach is designed to identify and validate candidate simplex-structured clusters in model activation spaces:

SAE Decomposition: Residual-stream activations (Gemma-2-9B, layer 20) are encoded with GemmaScope JumpReLU SAE, generating high-dimensional sparse latent features.
$k$ -Subspace Clustering: Decoder directions in the SAE latent space are clustered to identify feature groups potentially co-encoding contextual variables (preferentially in low-rank subspaces).
Simplex Fitting with AANet: Each cluster is fit with AANet, a neural archetypal analysis method, to test for simplex structure and recover barycentric coordinates relative to learned extreme points.
Functional Validation: Candidate geometries are evaluated via barycentric predictive advantage (primary discrimination test), causal steering, KL divergence, and semantic coherence.
Figure 1: Overview of the belief geometry discovery pipeline combining SAEs, clustering, simplex fitting, and rigorous validation.

Proof of Concept on Toy Model

The pipeline first undergoes validation using a multipartite HMM-based toy model with controlled ground-truth belief geometry. In this regime—characterized by high joint entropy and cross-component signal entanglement—SAEs and clustering robustly recover all five generative simplex components, achieving mean $R^2 = 0.61$ on held-out data. This demonstrates the pipeline's efficacy in identifying true belief geometries even under noisy, entangled signal conditions.

Figure 2: Token classes for Tom Quantum component are fully separated in the principal subspace, exemplifying strong geometric structure.

Figure 3: Representative latents tile complementary regions of the simplex, showing geometry-consistent partitioning of belief-state information.

Application to Gemma-2-9B: Real Data Analysis

The pipeline is applied to Gemma-2-9B, using SAE latents at layer 20. $k$ -subspace clustering at two resolutions yields 13 priority clusters (with $K \geq 3$ ), selected for further validation. Null clusters constructed by random partitioning demonstrate sparse geometric pass rates, underscoring the specificity of the pipeline.

Validation: Discriminating Genuine Belief-State Encoding

Barycentric Predictive Advantage

The primary test is whether barycentric coordinate vectors for each cluster predict next-token log-probabilities better than any individual latent, evaluated via paired $R^2$ comparisons and Wilcoxon signed-rank tests.

Result: 5 of 13 real clusters exhibit significant barycentric advantage ( $p<10^{-14}$ ), outperforming the best single latent for both near-vertex and simplex-interior samples. No null cluster achieves significance on either split.
Interpretation: This rules out tiling artifacts, indicating that these clusters encode genuine mixture representations rather than coverage of distinct context regions by single latents.
Figure 4: Cluster 512_181—barycentric coordinates ( $R^2 = 0.612$ ) significantly outperform best latent ( $R^2 = 0.539$ ) for all tokens.

Figure 5: Cluster 512_229—barycentric coordinates ( $R^2 = 0.378$ ) outperform best latent ( $k$ 0) for every token.

Figure 6: Cluster 768_596—barycentric coordinates ( $k$ 1) outperform best latent ( $k$ 2) for 98% of tokens.

Causal Steering

Only clusters with semantically interpretable and consistent simplex labeling are eligible for steering evaluation. In causal intervention tests (steering the residual stream toward simplex vertices), Cluster 768_596 exhibits a steering score of 0.419, the highest in the dataset; however, the overlap between steering scores for real and null clusters limits its discriminative utility except for 768_596, where significant barycentric advantage and steering effect converge.

Functional and Semantic Interpretations of Geometries

Cluster 768_596 partitions latents across simplex vertices corresponding to grammatical person (third, first, second), with evidence for mixture encoding at extreme points.

Figure 7: Mean barycentric centroid of latents in 768_596, partitioned across simplex vertices (consistent with feature specialization).

Cluster 512_181 demonstrates strong barycentric predictive advantage across both near-vertex and interior samples, suggesting robust mixture encoding though semantic coherence is weak, potentially indicating distributional statistical encoding rather than crisp lexical poles.

Implications and Future Directions

Theoretical Implications

The presence of simplex-structured geometries encoding mixture representations implies that pretrained LLMs internally model abstract distributions over latent variables—potentially indicative of implicit Bayesian inference strategies in natural language contexts [akyurek2023icl, xie2022meta-learning].
Some clusters encode functionally real but semantically diffuse distinctions, challenging existing interpretability paradigms reliant on crisp semantic labeling.

Practical Directions and Limitations

Full confirmation of belief-state tracking in naturalistic settings mandates structured datasets with reliably assignable simplex-position labels, a current gap in LLM interpretability infrastructure.
Effect sizes for steering are modest, and signal dissociation (phantom vertices, weak semantic interpretability) are prevalent, suggesting mixture encoding may be partial or highly context-dependent.
The analysis is limited to Gemma-2-9B, layer 20; assessing generality across layers, architectures, and model families is necessary.

Conclusion

The paper establishes a rigorous methodology to discover and validate simplex-structured belief geometries in SAE latent spaces of LLMs. The barycentric predictive advantage test robustly discriminates true mixture encoding from tiling artifacts in Gemma-2-9B. The convergence of prediction and causal intervention in Cluster 768_596 offers the strongest evidence for functional belief-geometric encoding. Critical future steps include the construction of structured evaluation datasets and extension of analysis to alternative architectures and interpretability tooling.

Markdown Report Issue