Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

110 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

343 2

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (2403.19647v2)

Published 28 Mar 2024 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining LLM behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors.

References (79)

Authors (6)

Samuel Marks (18 papers)
Can Rager (12 papers)
Eric J. Michaud (17 papers)
Yonatan Belinkov (111 papers)
David Bau (62 papers)
Aaron Mueller (35 papers)

Citations (64)

View on Semantic Scholar

Summary

Discovering and Editing Interpretable Causal Graphs in LLMs

Introduction to Sparse Feature Circuits

The quest for interpretability in LLMs has led researchers to various avenues, one of which involves understanding the internal mechanisms— or circuits— that contribute to the model's behavior. Traditional approaches focusing on coarse-grained model components such as attention heads or MLP modules have provided valuable insights, yet their polysemantic nature often complicates downstream applications. This paper introduces sparse feature circuits as a scalable and interpretive solution to dissect LLMs' inner workings effectively.

Sparse Feature Circuits and Their Discovery

Sparse feature circuits represent computational sub-graphs within LLMs, emphasizing fine-grained, human-interpretable units. These circuits are constructed using sparse autoencoders (SAEs) trained to identify interpretable directions in the model's latent space, overcoming the challenge of finding appropriate fine-grained units for analysis. The method employs linear approximations— specifically, attribution patching and integrated gradients— for efficient discovery of causally significant sparse features and their connections within the model. This approach notably addresses scalability, facilitating the exploration of LLMs' vast and intricate computational graphs.

Practical Applications and Implications

Sparse feature circuits open new avenues for applying interpretability insights to practical tasks. One such application, Shift (Sparse Human-Interpretable Feature Trimming), leverages these circuits to enhance a classifier's generalization by removing sensitivity to irrelevant features judged by humans. This method demonstrates the potential to debias classifiers without the need for disambiguating labeled data, addressing challenges in scenarios where unintended signals are strongly correlated with target labels.

In addition to targeted applications like Shift, the paper explores an unsupervised pipeline for discovering thousands of sparse feature circuits corresponding to automatically identified model behaviors. This fully automated process initiates with raw text and culminates in detailed interpretations of diverse LLM behaviors, showcasing the scalable nature of the proposed method.

Theoretical and Practical Contributions

The paper contributes significantly to the interpretability research landscape by combining the granular insights offered by sparse feature circuits with scalable discovery methods. The introduction of Shift further exemplifies how these circuits can be pragmatically leveraged to address pressing issues such as model bias and spurious correlations. The unsupervised discovery pipeline represents another cornerstone, providing a comprehensive tool for untangling the complex mechanisms underlying LLM predictions.

Future Directions in AI Interpretability

Looking forward, the development and refinement of sparse feature circuits hold promise for advancing our understanding of LLMs and enhancing their reliability in real-world applications. By shedding light on the specific roles of fine-grained components in model behaviors, researchers can pave the way for more interpretable, fair, and robust AI systems. Furthermore, exploring automated methods for circuit annotation and refinement could streamline the interpretability workflow, making it accessible for a broader range of models and applications.

In conclusion, the advent of sparse feature circuits marks a significant step toward demystifying the black box of LLMs, offering a scalable and interpretable framework for deciphering and editing the causal graphs that drive model behavior.

PDF Markdown

Tweets

https://twitter.com/StephenLCasper/status/1774116542935879906

https://twitter.com/StephenLCasper/status/1811536196562223161

https://twitter.com/ericjmichaud_/status/1858664720112427316

https://twitter.com/amuuueller/status/1775515339993804823

https://twitter.com/topofmlsafety/status/1774809946447585783

https://twitter.com/saprmarks/status/1775514398318117090

YouTube

Show All Videos