The Geometry of Categorical and Hierarchical Concepts in Large Language Models (2406.01506v3)
Abstract: The linear representation hypothesis is the informal idea that semantic concepts are encoded as linear directions in the representation spaces of LLMs. Previous work has shown how to make this notion precise for representing binary concepts that have natural contrasts (e.g., {male, female}) as directions in representation space. However, many natural concepts do not have natural contrasts (e.g., whether the output is about an animal). In this work, we show how to extend the formalization of the linear representation hypothesis to represent features (e.g., is_animal) as vectors. This allows us to immediately formalize the representation of categorical concepts as polytopes in the representation space. Further, we use the formalization to prove a relationship between the hierarchical structure of concepts and the geometry of their representations. We validate these theoretical results on the Gemma and LLaMA-3 LLMs, estimating representations for 900+ hierarchically related concepts using data from WordNet.
Summary
- The paper demonstrates that LLMs encode semantic concepts via linear geometric structures, enabling manipulation of both categorical and hierarchical representations.
- It introduces methodologies like causal inner products and linear discriminant analysis to reveal orthogonal and simplex structures within the embedding space.
- Empirical validation on Gemma-2B with WordNet hierarchies confirms the geometric encoding, offering actionable insights for enhancing model interpretability and control.
This paper, "The Geometry of Categorical and Hierarchical Concepts in LLMs" (2406.01506), investigates how LLMs encode semantic meaning, specifically focusing on categorical concepts (like "animal") and hierarchical relationships between them (like "dog" being a type of "mammal"). Understanding this encoding is crucial for LLM interpretability and control, allowing potential monitoring or manipulation of semantic behavior by interacting with internal vector representations.
The research starts from the "linear representation hypothesis," which suggests that high-level concepts are linearly encoded in LLMs' representation spaces. A challenge is formalizing what "linear" and "high-level concept" mean beyond simple binary, counterfactual concepts. This paper extends previous work by formalizing concepts more generally and exploring non-binary and hierarchically related concepts.
The paper defines a concept as a latent variable caused by context and causing the output. Two concepts are "causally separable" if they can be freely manipulated independently. Hierarchical relations are defined based on sets of tokens associated with attributes: an attribute z is subordinate to w (z≺w) if the set of tokens for z is a subset of those for w. A categorical concept is subordinate to another if all its values are subordinate to a single value of the parent concept.
A key step is unifying the LLM's context embedding space (Λ) and token unembedding space (Γ). The paper relies on the idea of a "Causal Inner Product," which allows transforming these spaces via matrices (A and A−⊤) and an origin shift (γˉ0) so that the Euclidean inner product in the transformed space aligns with semantic relationships (e.g., orthogonal directions for causally separable concepts). This transformation makes the embedding and unembedding spaces effectively the same for geometric analysis.
The definition of a linear representation for a binary concept is refined. Unlike previous definitions that only required not affecting causally separable concepts, this paper adds the requirement that the representation also does not affect the probabilities of subordinate concepts. This ensures that, for example, an "animal" representation doesn't accidentally conflate properties specific to "mammals" vs. "birds" within the "animal" category.
The paper then moves from representations as directions to representations as vectors with magnitude. Theorem 1 (Magnitude) shows that for a binary feature representing an attribute w, there exists a choice of origin in the transformed unembedding space such that the dot product between the concept's representation vector (ℓˉw) and the unembedding vector of a token (g(y)) is a positive constant (bw) if y has attribute w, and zero otherwise. This provides a natural magnitude for the representation, proportional to bw. This allows for vector operations like addition and subtraction on concept representations.
Using vector representations, the paper establishes how semantic hierarchy is encoded geometrically in the representation space. Theorem 2 (Orthogonality) predicts several forms of orthogonality related to hierarchical structure:
- The difference between two feature vectors (ℓˉw1−ℓˉw0) represents the contrast between the two attributes.
- A concept vector ℓˉw is orthogonal to the vector representing a subordinate attribute contrast, such as ℓˉz−ℓˉw for z≺w.
- The vector representing a parent contrast (e.g., ℓˉw1−ℓˉw0) is orthogonal to the vector representing a child contrast subordinate to it (e.g., ℓˉz1−ℓˉz0 where {z0,z1} is subordinate to {w0,w1}).
- Differences between hierarchically nested concepts are orthogonal (e.g., ℓˉw1−ℓˉw0 is orthogonal to ℓˉw2−ℓˉw1 if w2≺w1≺w0).
For categorical concepts with multiple values (e.g., {mammal, bird, fish}), the paper defines their representation as the convex hull (polytope) of the vector representations of their constituent binary features. Theorem 3 (Simplex) predicts that for "natural" categorical concepts where the model can freely manipulate the probabilities of the constituent values, their vector representations form a (k−1)-simplex in the representation space, where k is the number of values.
The combined theoretical results suggest a simple geometric structure: hierarchical concepts are represented as direct sums of simplices, with the direct sum structure arising from the orthogonality predictions. This structure is summarized visually in Figure 1 of the paper.
Empirical validation was performed using the Gemma-2B model and concepts derived from the WordNet hierarchy [1995.wordnet].
- Setup: The canonical space was estimated by whitening the unembedding matrix. Attributes and hierarchies were extracted from WordNet synsets, filtering for those with sufficient vocabulary coverage (593 noun, 364 verb synsets).
- Estimation: Vector representations (ℓˉw) for each attribute were estimated using a variant of Linear Discriminant Analysis (LDA) on the unembedding vectors of words belonging to the synset's word set Y(w). This method seeks a projection direction that separates words in Y(w) from others while minimizing variance within Y(w), aligning with Theorem 1.
- Results:
- Evaluations (Figure 5 for nouns, Figure 8 in appendix for verbs) showed that projections of test words onto their estimated concept vectors were consistently close to 1 (after normalization), supporting the existence of vector representations as described in Theorem 1.
- Visualizations (Figures 3 and 4) demonstrated the predicted orthogonality and simplex structure for specific concepts like "animal" and its subcategories. Figure 3 shows projections onto 2D subspaces spanned by concept vectors, illustrating orthogonality between parent concepts and child contrasts. Figure 4 shows a 3D simplex for "mammal," "bird," "fish" vectors and its orthogonality to the "animal" vector.
- Large-scale analysis across the WordNet hierarchy (Figure 6 for nouns, Figure 9 in appendix for verbs) showed heatmaps of cosine similarity between concept vectors (ℓˉw) and child-parent difference vectors (ℓˉw−ℓˉparent). The ℓˉw similarities clearly reflected the WordNet hierarchy structure (high similarity between parent and child concepts). Crucially, the child-parent difference vectors exhibited low cosine similarity (near orthogonality), empirically confirming the orthogonality predictions of Theorem 2 across the learned representations.
Practical Implications and Future Work:
The findings provide a fundamental understanding of how semantic hierarchy is represented in LLMs' vector spaces.
- Interpretability: This geometry suggests new approaches for interpreting LLMs. Instead of searching for sparse features that align with absolute concept vectors (ℓˉw), which are often highly collinear for related concepts, it might be more effective to search for features corresponding to the orthogonal difference vectors (e.g., $\bar\ell_{#1{mammal}} - \bar\ell_{#1{animal}}$). This aligns interpretability methods like sparse autoencoders with the underlying hierarchical structure.
- Control: A clearer understanding of this geometric structure could lead to more precise methods for steering LLM outputs based on complex, structured concepts.
- Limitations: The analysis primarily focuses on the final layer representation due to the estimation method for the canonical space. Extending this understanding to internal layers is an important open problem.
The code used for the experiments is publicly available at github.com/KihoPark/LLM_Categorical_Hierarchical_Representations
, allowing practitioners to replicate the findings and apply the estimation methods to other LLMs or concept hierarchies.
Related Papers
- The Linear Representation Hypothesis and the Geometry of Large Language Models (2023)
- Solving Hard Analogy Questions with Relation Embedding Chains (2023)
- Hierarchical Semantic Tree Concept Whitening for Interpretable Image Classification (2023)
- On the Origins of Linear Representations in Large Language Models (2024)
- On the universal structure of human lexical semantics (2015)
Tweets
HackerNews
- The Geometry of Categorical and Hierarchical Concepts in Large Language Models (122 points, 15 comments)
- The Geometry of Categorical and Hierarchical Concepts in Large Language Models (7 points, 0 comments)
- The Geometry of Categorical and Hierarchical Concepts in Large Language Models (3 points, 0 comments)
- The Geometry of Categorical and Hierarchical Concepts in Large Language Models (3 points, 0 comments)
- The Geometry of Categorical and Hierarchical Concepts in Large Language Models (2 points, 4 comments)