- The paper proposes a novel framework that models features as metric spaces, revealing low-dimensional manifolds in LLM representations.
- The paper shows that cosine similarity decreases with squared feature distance, supporting the continuous correspondence hypothesis.
- The paper validates isometry between geodesic distances on manifolds and feature spaces, uncovering non-linear scaling in temporal data.
This paper (2505.18235) addresses a key challenge in mechanistic interpretability: understanding how LLMs represent human-understandable concepts. While the Linear Representation Hypothesis (LRH) suggests features are encoded as sparse linear combinations of vectors, empirical evidence shows that representations often exhibit more complex, non-linear structures, frequently appearing as low-dimensional manifolds within the higher-dimensional representation space. Sparse Autoencoders (SAEs) are a tool used to recover features, often based on the LRH, but the paper argues they don't fully capture the geometric richness observed.
The authors propose a mathematical framework where a feature is defined as a metric space (Zf,df), which is a set Zf equipped with a distance function df. This allows for a unified way to formalize different types of features:
- Atomic features: Zf is a single point.
- Hierarchical features: Zf is a discrete set with a tree distance.
- Continuous features: Zf is a space like an interval or circle with an appropriate distance.
For any input x where feature f is present, the paper assumes a value zf(x)∈Zf exists, corresponding to the feature's state in that input. The core of the paper's theory rests on two main hypotheses:
- Continuous Correspondence Hypothesis: There's a continuous, one-to-one map ϕf:Zf→SD−1 (mapping the feature space to the unit hypersphere in representation space) such that the feature's direction vector vf(x) is given by ϕf(zf(x)). The image of this map, Mf=ϕf(Zf), is the representation manifold. A crucial implication is that Mf should be homeomorphic to Zf, meaning they share the same topological structure (e.g., number of holes, connected components).
- Cosine Similarity Reflects Distance Hypothesis: Locally, cosine similarity between two representation vectors is a decreasing function of the squared distance between their corresponding feature values: ⟨ϕf(z),ϕf(z′)⟩=gf(df(z,z′)2) for small df(z,z′), where gf′(0)<0.
Combining these hypotheses leads to a significant practical insight (Theorem 1): the geodesic distance (shortest path length) on the representation manifold Mf is proportional to the geodesic distance on the feature metric space Zf, up to a scaling factor. This means that the intrinsic geometry of the feature space is encoded in the shortest paths on the representation manifold.
The paper empirically validates these ideas using text embeddings from OpenAI's text-embedding-large-3
model and token activations from GPT2-small
.
Practical Implementation and Validation Steps:
To test these hypotheses in practice, one can follow these steps, inspired by the paper's experiments:
- Select a Feature and Collect Representations: Identify a potential feature (e.g., colors, years, dates). Create a set of inputs X designed to vary this feature systematically. Obtain the relevant representations (e.g., text embeddings, or internal layer activations processed by tools like SAEs to isolate a potential feature).
- Hypothesize a Metric Space: Define the feature space Zf and a plausible metric df for the chosen feature (e.g., for years 1900-1999, Zf=[1900,1999] and df(x,y)=∣x−y∣).
- Dimensionality Reduction (Optional but often necessary): Apply dimensionality reduction techniques like PCA to visualize and potentially analyze the representations in a lower dimension (e.g., 3D). The paper notes that for text embeddings, projecting onto uncentered principal components seemed effective. The choice of projection dimension and method can impact results and might reflect different aspects of the feature.
- Check for Homeomorphism: Visualize the reduced-dimension representation points. Does the point cloud appear to have the same topological structure as the hypothesized feature space Zf? For features with intrinsic order (like years or dates), compute rank correlations (e.g., Kendall, Spearman) between the feature values and their estimated position along the manifold structure (e.g., using geodesic distance from a reference point). High rank correlation supports homeomorphism.
- Test Hypothesis 2 (Cosine Similarity vs. Feature Distance):
- For all pairs of representations (vi,vj) corresponding to feature values (zi,zj), calculate their cosine similarity ⟨vi,vj⟩ and the squared distance df(zi,zj)2.
- Plot ⟨vi,vj⟩ against df(zi,zj)2. Visually check if cosine similarity is high for small feature distances and decreases as distance increases, especially around zero.
- Quantify the functional dependence using a metric like Chatterjee's correlation coefficient ξ.
- Test Theorem 1 (Geodesic Distance Isometry):
- Estimate geodesic distances on the representation manifold Mf. A common method is to construct a K-Nearest Neighbor (K-NN) graph on the representation vectors (using Euclidean distance), connect neighbors, and compute shortest path distances on this graph (weighted by edge lengths). The choice of K is critical; it should be small enough to preserve manifold structure but large enough for connectivity. Manual pruning of the graph might be necessary to remove spurious shortcuts caused by noise.
- For all pairs of points (zi,zj) in Zf, compute their geodesic distance dgeo(zi,zj) based on the metric df. For simple spaces like intervals or circles, this might just be the shortest path distance in that space (e.g., ∣x−y∣ for an interval, arc length for a circle).
- Plot the estimated manifold geodesic distance against dgeo(zi,zj). Check if there is a clear linear relationship.
- Quantify the linearity using Pearson's correlation coefficient ρ. A high value supports isometry.
Empirical Findings and Insights:
The paper successfully applied these steps to colors, years, and dates:
- Colors and Dates: Embeddings for color names and dates of the year formed approximately circular structures in PCA-reduced space, consistent with a circular metric space (Zf=[0,L),df(x,y)=min(∣x−y∣,L−∣x−y∣)). The geodesic distances on the estimated manifolds showed strong linear correlation with geodesic distances in the circular feature space, supporting isometry (Pearson ρ>0.97).
- Years: Token activations for years (extracted via SAEs as in (2503.17547)) formed a curve. While rank correlation showed they were in chronological order (homeomorphism), a simple linear metric df(x,y)=∣x−y∣ did not show isometry. However, using a logarithmic scale for years, zyear=log(2019−year), resulted in strong evidence for isometry. This suggests the model represents temporal distances on a non-linear scale.
Implementation Considerations and Limitations:
- Scalability: The current hypothesis-driven approach requires manual definition of the metric space Zf and df for each feature, which is not scalable for complex features like emotions. Developing methods to learn the feature space geometry from the representations is a direction for future work.
- Dimensionality: Using PCA projections simplifies analysis but might discard dimensions containing important, albeit perhaps less easily interpretable, geometric structure. Semantic similarity is likely richer than simple 1D or 2D metric spaces.
- Manifold Estimation Noise: Estimating geodesic distances on the manifold using K-NN graphs is sensitive to noise and requires careful selection of K and potentially manual graph pruning to avoid "short circuits". More robust manifold learning techniques could improve this.
- SAEs and Manifolds: The paper conjectures that SAEs, while based on a linear sparse coding idea, might learn dictionaries whose vectors trace representation manifolds. This could explain phenomena like feature splitting and suggests the potential for "manifold-aware" SAEs designed to explicitly recover these geometric structures.
In conclusion, this research provides a valuable theoretical framework and empirical methodology for analyzing and interpreting the geometric structure of concept representations in LLMs. It highlights that these representations can capture the intrinsic geometry of features and opens avenues for developing new interpretability tools that account for manifold structure.