Papers
Topics
Authors
Recent
Search
2000 character limit reached

Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders

Published 9 Oct 2024 in cs.LG, cs.AI, and cs.CL | (2410.06981v4)

Abstract: The Universality Hypothesis in LLMs claims that different models converge towards similar concept representations in their latent spaces. Providing evidence for this hypothesis would enable researchers to exploit universal properties, facilitating the generalization of mechanistic interpretability techniques across models. Previous works studied if LLMs learned the same features, which are internal representations that activate on specific concepts. Since comparing features across LLMs is challenging due to polysemanticity, in which LLM neurons often correspond to multiple unrelated features rather than to distinct concepts, sparse autoencoders (SAEs) have been employed to disentangle LLM neurons into SAE features corresponding to distinct concepts. In this paper, we introduce a new variation of the universality hypothesis called Analogous Feature Universality: we hypothesize that even if SAEs across different models learn different feature representations, the spaces spanned by SAE features are similar, such that one SAE space is similar to another SAE space under rotation-invariant transformations. Evidence for this hypothesis would imply that interpretability techniques related to latent spaces, such as steering vectors, may be transferred across models via certain transformations. To investigate this hypothesis, we first pair SAE features across different models via activation correlation, and then measure spatial relation similarities between paired features via representational similarity measures, which transform spaces into representations that reveal hidden relational similarities. Our experiments demonstrate high similarities for SAE feature spaces across various LLMs, providing evidence for feature space universality.

Citations (2)

Summary

  • The paper employs sparse autoencoders to transform LLM activations into interpretable feature spaces, revealing substantial universality across different models.
  • It uses dictionary learning with SAEs and SVCCA to rigorously quantify feature similarities and highlight the significance of middle-layer representations.
  • The findings suggest potential for efficient training paradigms and enhanced AI safety through improved interpretability and control of LLM representations.

Sparse Autoencoders Reveal Universal Feature Spaces Across LLMs

This paper addresses the intriguing topic of feature universality in LLMs, focusing on understanding how different models represent concepts similarly in their latent spaces. The authors tackle the complexities of comparing features across LLMs, primarily due to the phenomenon of polysemanticity, where individual neurons encode multiple features. To disentangle and analyze these representations, the authors employ sparse autoencoders (SAEs) as a method to transform LLM activations into more interpretable spaces.

The methodology involves using dictionary learning with SAEs to identify individual feature neurons within LLMs. The research further employs representational similarity metrics, such as Singular Value Canonical Correlation Analysis (SVCCA), to rigorously quantify the feature universality across various LLMs. Through comprehensive experiments, the paper demonstrates that SAE feature spaces exhibit substantial similarities across different models, providing compelling evidence for the presence of universal features.

The implication of this research is multifaceted. Practically, it suggests the potential for developing more efficient training paradigms and transfer learning techniques that leverage universal features. Theoretically, it contributes to the ongoing discourse on model interpretability and the internal mechanisms of LLMs. The evidence presented may facilitate advancements in safety and control in AI systems by identifying and mitigating potentially harmful model representations.

Notably, the paper presents a novel framework that contrasts with traditional evaluations by examining feature weight representations rather than just activations. This methodological innovation allows for a deeper investigation into the model's representational capabilities. Additionally, the work underscores the significance of middle layers in LLMs for representing semantically meaningful subspaces, corroborating previous findings about the efficacy of these layers in capturing complex concepts.

In conclusion, this paper contributes significantly to the understanding of feature universality in LLMs, advocating for further exploration of how these insights could enhance AI's interpretability and controllability. The findings open new avenues for future research, including potential applications in AI safety and the development of more efficient AI models. As the field continues to advance, the exploration of universal features across diverse architectures and the implications for model design and optimization will remain a critical area of inquiry.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 91 likes about this paper.