The Linear Representation Hypothesis and the Geometry of Large Language Models (2311.03658v2)

Published 7 Nov 2023 in cs.CL, cs.AI, cs.LG, and stat.ML

Abstract: Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.

Citations (85)

View on Semantic Scholar

Summary

The paper formalizes the notion of linear representation in LLMs by distinguishing between embedding and unembedding spaces for high-level concept measurement.
The authors introduce a causal inner product that unifies geometric interpretations, enabling effective concept measurement and intervention.
Experimental results on LLaMA-2 demonstrate that the choice of inner product critically impacts model interpretability and control.

The Linear Representation Hypothesis and the Geometry of LLMs: A Formal Investigation

The paper "The Linear Representation Hypothesis and the Geometry of LLMs" explores the formal underpinnings of the often-discussed Linear Representation Hypothesis concerning LLMs. The authors aim to clarify the notion of linear representation, its implications for understanding geometric properties in representation spaces, and its connections to key model interpretability and control methodologies.

The authors begin by defining the Linear Representation Hypothesis in the context of LLMs. This hypothesis suggests that high-level concepts in LLMs are represented in a linear manner within the model's representation space. Such representations could potentially allow for linear algebraic operations to gain insights into model behavior or even steer outputs by leveraging these linear structures. The paper distinguishes between various interpretations of linear representation—specifically, notions of subspace, measurement, and intervention—and scrutinizes their interrelations.

Key Contributions

Formalization of Linear Representations: The paper offers a rigorous formalization of linear representation, which it divides into representations in the output (unembedding) space and the input (embedding) space. The unembedding space representation is tied to the concept of using linear probes for concept measurement, while the embedding space representation connects with the idea of using steering vectors for interventions.
Geometric Interpretations: A significant contribution is the introduction of a causal inner product tailored to respect language structure by ensuring orthogonality among causally separable concept vectors. This helps unify the previously disparate notions of linear representations across different spaces.
Experimental Validation: Using the LLaMA-2 LLM as a testbed, the authors empirically demonstrate the existence of linear representations for various concepts. Experiments highlight that the choice of inner product fundamentally influences the interpretation and control of LLM behavior, bolstering the connection between linearity, measurement, and intervention.

Theoretical and Practical Implications

The paper's results offer a significant leap in formalizing how LLMs encode high-level language concepts. By establishing a relationship between representation spaces and formalizing the notion of a causal inner product, this work lays the groundwork for more systematic interpretability studies. It implies that practitioners can potentially derive more reliable methods for probing and controlling LLMs by understanding their linear structure.

Moreover, the findings raise intriguing possibilities for further enhancing LLMs’ capabilities by employing linear algebraic techniques grounded in a causal understanding of concept separability. As the inner product selection directly impacts the arsenal of methods available for steering and interpreting model behavior, future research may delve into developing unique inner products that further improve the efficacy of model manipulation.

Speculation on Future Developments

One can speculate that as LLMs continue to evolve, the methods developed in this work might be central to creating models that are not only more interpretable but also safer and more reliable. With a robust understanding of the geometric representation of concepts, future models might offer refined control over their behavior in diverse contextual settings, leading to applications with greater utility and precision.

Overall, this paper provides a meticulous exploration of the linear representation of concepts within LLMs, setting the stage for refined techniques in AI interpretability and control that can align performance with user intentions while maintaining conceptual integrity.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AndrewLampinen/status/1830716577886711846

https://twitter.com/jasonhartford/status/1767296980861255981

https://twitter.com/imitationlearn/status/1879081656746926232

https://twitter.com/JustinWaugh/status/1874979305517957236

https://twitter.com/TAGinDS/status/1762234769004503321

https://twitter.com/4confusedemoji/status/1842958921063096331

YouTube

Show All Videos