The Origins of Representation Manifolds in Large Language Models (2505.18235v1)

Published 23 May 2025 in cs.LG and cs.AI

Abstract: There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths, potentially answering the question of how distance in representation space and relatedness in concept space could be connected. The critical assumptions and predictions of the theory are validated on text embeddings and token activations of LLMs.

Summary

The paper proposes a novel framework that models features as metric spaces, revealing low-dimensional manifolds in LLM representations.
The paper shows that cosine similarity decreases with squared feature distance, supporting the continuous correspondence hypothesis.
The paper validates isometry between geodesic distances on manifolds and feature spaces, uncovering non-linear scaling in temporal data.

This paper (2505.18235) addresses a key challenge in mechanistic interpretability: understanding how LLMs represent human-understandable concepts. While the Linear Representation Hypothesis (LRH) suggests features are encoded as sparse linear combinations of vectors, empirical evidence shows that representations often exhibit more complex, non-linear structures, frequently appearing as low-dimensional manifolds within the higher-dimensional representation space. Sparse Autoencoders (SAEs) are a tool used to recover features, often based on the LRH, but the paper argues they don't fully capture the geometric richness observed.

The authors propose a mathematical framework where a feature is defined as a metric space $(\mathcal{Z}_f, d_f)$ , which is a set $\mathcal{Z}_f$ equipped with a distance function $d_f$ . This allows for a unified way to formalize different types of features:

Atomic features: $\mathcal{Z}_f$ is a single point.
Hierarchical features: $\mathcal{Z}_f$ is a discrete set with a tree distance.
Continuous features: $\mathcal{Z}_f$ is a space like an interval or circle with an appropriate distance.

For any input $x$ where feature $f$ is present, the paper assumes a value $z_f(x) \in \mathcal{Z}_f$ exists, corresponding to the feature's state in that input. The core of the paper's theory rests on two main hypotheses:

Continuous Correspondence Hypothesis: There's a continuous, one-to-one map $\phi_f: \mathcal{Z}_f \rightarrow \mathbb{S}^{D-1}$ (mapping the feature space to the unit hypersphere in representation space) such that the feature's direction vector $v_f(x)$ is given by $\phi_f(z_f(x))$ . The image of this map, $\mathcal{M}_f = \phi_f(\mathcal{Z}_f)$ , is the representation manifold. A crucial implication is that $\mathcal{M}_f$ should be homeomorphic to $\mathcal{Z}_f$ , meaning they share the same topological structure (e.g., number of holes, connected components).
Cosine Similarity Reflects Distance Hypothesis: Locally, cosine similarity between two representation vectors is a decreasing function of the squared distance between their corresponding feature values: $\langle \phi_f(z), \phi_f(z') \rangle = g_f(d_f(z, z')^2)$ for small $d_f(z, z')$ , where $g_f'(0) < 0$ .

Combining these hypotheses leads to a significant practical insight (Theorem 1): the geodesic distance (shortest path length) on the representation manifold $\mathcal{M}_f$ is proportional to the geodesic distance on the feature metric space $\mathcal{Z}_f$ , up to a scaling factor. This means that the intrinsic geometry of the feature space is encoded in the shortest paths on the representation manifold.

The paper empirically validates these ideas using text embeddings from OpenAI's text-embedding-large-3 model and token activations from GPT2-small.

Practical Implementation and Validation Steps:

To test these hypotheses in practice, one can follow these steps, inspired by the paper's experiments:

Select a Feature and Collect Representations: Identify a potential feature (e.g., colors, years, dates). Create a set of inputs $X$ designed to vary this feature systematically. Obtain the relevant representations (e.g., text embeddings, or internal layer activations processed by tools like SAEs to isolate a potential feature).
Hypothesize a Metric Space: Define the feature space $\mathcal{Z}_f$ and a plausible metric $d_f$ for the chosen feature (e.g., for years 1900-1999, $\mathcal{Z}_f = [1900, 1999]$ and $d_f(x, y) = |x-y|$ ).
Dimensionality Reduction (Optional but often necessary): Apply dimensionality reduction techniques like PCA to visualize and potentially analyze the representations in a lower dimension (e.g., 3D). The paper notes that for text embeddings, projecting onto uncentered principal components seemed effective. The choice of projection dimension and method can impact results and might reflect different aspects of the feature.
Check for Homeomorphism: Visualize the reduced-dimension representation points. Does the point cloud appear to have the same topological structure as the hypothesized feature space $\mathcal{Z}_f$ ? For features with intrinsic order (like years or dates), compute rank correlations (e.g., Kendall, Spearman) between the feature values and their estimated position along the manifold structure (e.g., using geodesic distance from a reference point). High rank correlation supports homeomorphism.
Test Hypothesis 2 (Cosine Similarity vs. Feature Distance):
- For all pairs of representations $(v_i, v_j)$ corresponding to feature values $(z_i, z_j)$ , calculate their cosine similarity $\langle v_i, v_j \rangle$ and the squared distance $d_f(z_i, z_j)^2$ .
- Plot $\langle v_i, v_j \rangle$ against $d_f(z_i, z_j)^2$ . Visually check if cosine similarity is high for small feature distances and decreases as distance increases, especially around zero.
- Quantify the functional dependence using a metric like Chatterjee's correlation coefficient $\xi$ .
Test Theorem 1 (Geodesic Distance Isometry):
- Estimate geodesic distances on the representation manifold $\mathcal{M}_f$ . A common method is to construct a K-Nearest Neighbor (K-NN) graph on the representation vectors (using Euclidean distance), connect neighbors, and compute shortest path distances on this graph (weighted by edge lengths). The choice of K is critical; it should be small enough to preserve manifold structure but large enough for connectivity. Manual pruning of the graph might be necessary to remove spurious shortcuts caused by noise.
- For all pairs of points $(z_i, z_j)$ in $\mathcal{Z}_f$ , compute their geodesic distance $d_{geo}(z_i, z_j)$ based on the metric $d_f$ . For simple spaces like intervals or circles, this might just be the shortest path distance in that space (e.g., $|x-y|$ for an interval, arc length for a circle).
- Plot the estimated manifold geodesic distance against $d_{geo}(z_i, z_j)$ . Check if there is a clear linear relationship.
- Quantify the linearity using Pearson's correlation coefficient $\rho$ . A high value supports isometry.

Empirical Findings and Insights:

The paper successfully applied these steps to colors, years, and dates:

Colors and Dates: Embeddings for color names and dates of the year formed approximately circular structures in PCA-reduced space, consistent with a circular metric space $(\mathcal{Z}_f = [0, L), d_f(x,y) = \min(|x-y|, L - |x-y|))$ . The geodesic distances on the estimated manifolds showed strong linear correlation with geodesic distances in the circular feature space, supporting isometry (Pearson $\rho > 0.97$ ).
Years: Token activations for years (extracted via SAEs as in (2503.17547)) formed a curve. While rank correlation showed they were in chronological order (homeomorphism), a simple linear metric $d_f(x,y)=|x-y|$ did not show isometry. However, using a logarithmic scale for years, $z_{year} = \log(2019 - \text{year})$ , resulted in strong evidence for isometry. This suggests the model represents temporal distances on a non-linear scale.

Implementation Considerations and Limitations:

Scalability: The current hypothesis-driven approach requires manual definition of the metric space $\mathcal{Z}_f$ and $d_f$ for each feature, which is not scalable for complex features like emotions. Developing methods to learn the feature space geometry from the representations is a direction for future work.
Dimensionality: Using PCA projections simplifies analysis but might discard dimensions containing important, albeit perhaps less easily interpretable, geometric structure. Semantic similarity is likely richer than simple 1D or 2D metric spaces.
Manifold Estimation Noise: Estimating geodesic distances on the manifold using K-NN graphs is sensitive to noise and requires careful selection of K and potentially manual graph pruning to avoid "short circuits". More robust manifold learning techniques could improve this.
SAEs and Manifolds: The paper conjectures that SAEs, while based on a linear sparse coding idea, might learn dictionaries whose vectors trace representation manifolds. This could explain phenomena like feature splitting and suggests the potential for "manifold-aware" SAEs designed to explicitly recover these geometric structures.

In conclusion, this research provides a valuable theoretical framework and empirical methodology for analyzing and interpreting the geometric structure of concept representations in LLMs. It highlights that these representations can capture the intrinsic geometry of features and opens avenues for developing new interpretability tools that account for manifold structure.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/f14bertolotti/status/1927262464799855004

https://twitter.com/tensorqt/status/1927340959508865373