Linear Representations of Sentiment in Large Language Models (2310.15154v1)

Published 23 Oct 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within LLMs. In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions.

Citations (53)

View on Semantic Scholar

Summary

The paper demonstrates that sentiment in LLMs is captured as a single linear direction in activation space.
It employs methods like PCA, K-means, and DAS, with a 76% accuracy drop on SST when the sentiment direction is ablated.
The study identifies a summarization motif in attention heads, revealing neural circuits crucial for processing sentiment.

Linear Representations of Sentiment in LLMs

This paper addresses how sentiment is internally represented within LLMs, investigating the hypothesis that sentiment can be captured as a single linear direction in activation space. The authors conduct an extensive analysis using various models and datasets, including toy datasets crafted for the paper and the Stanford Sentiment Treebank (SST).

Sentiment Direction and Causal Interventions

The core premise of the paper is that sentiment in LLMs is represented linearly, with a single direction in activation space distinguishing positive from negative sentiment. This linearity is consistently observed across different tasks and models, including GPT-2 and Pythia variants. Various methods, such as Mean Difference, $K$ -means clustering, Linear Probing, Distributed Alignment Search (DAS), and Principal Component Analysis (PCA), converge on similar sentiment directions, suggesting robustness in the identified representation.

The authors apply causal interventions, particularly directional activation patching, to validate the importance of the sentiment direction. This approach isolates the sentiment direction and demonstrates its causal significance in predicting sentiment in both toy and real-world datasets. Notably, a 76% drop in classification accuracy on SST when this direction is ablated underscores its critical role.

Mechanisms and Summarization

In dissecting LLM circuitry, the paper identifies specific attention heads and neurons that participate in processing sentiment, coining the term "summarization motif". This motif refers to the model’s strategy of aggregating sentiment information at intermediate tokens, such as punctuation and proper nouns, rather than relying solely on emotion-laden words. This insight is pivotal in understanding how models abstract and condense sentiment information. The paper finds that this summarized sentiment contributes significantly to classification tasks, with 36% of performance loss linked to ablating comma positions in SST validations.

Implications and Future Directions

The implications are multifaceted, impacting both theoretical understanding and practical applications. The notion of a ubiquitous direction for sentiment points to the model's capacity for linear abstraction, a feature that could extend beyond sentiment to other latent variables. The identification of summarization motifs suggests that LLMs possess mechanisms akin to attention-based summarization, albeit more organically integrated within their architecture.

For future research, exploring the summarization motif in other contexts could unveil more about the model's structuring of abstract information. Additionally, extending this methodology to other latent constructs within language, such as emotion gradients or syntactical roles, may reveal similarly structured representations.

Conclusion

While many questions remain, particularly regarding the exact nature and universality of sentiment representation, this paper provides a compelling case for the linearity and causal significance of sentiment in LLMs. The methodologies and insights offered have potential applications in the interpretability of models, contributing to safer and more transparent AI systems. The exploration establishes a framework for future research into deciphering the complex latent spaces within neural networks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/NeelNanda5/status/1773496682011107764

https://twitter.com/aryaman2020/status/1782988158545907904

https://twitter.com/AndrewCurran_/status/1774106680072319295

https://twitter.com/aryaman2020/status/1911929138384175239

YouTube

Show All Videos