Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (2403.19647v2)

Published 28 Mar 2024 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining LLM behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
  2. LEACE: Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=awIpKpwTwF.
  3. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
  4. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  5. Understanding disentangling in β𝛽\betaitalic_β-vae, 2018.
  6. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023.
  7. Open problems and fundamental limitations of reinforcement learning from human feedback, 2023.
  8. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLMs. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=MO5PiKHELW.
  9. Isolating sources of disentanglement in variational autoencoders, 2018. URL https://openreview.net/forum?id=BJdMRoCIf.
  10. Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  2180–2188, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  11. Deep reinforcement learning from human preferences, 2023.
  12. Towards automated circuit discovery for mechanistic interpretability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  13. Environment inference for invariant learning. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  2189–2200. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/creager21a.html.
  14. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=F76bwRSLeK.
  15. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, pp.  120–128, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.3287572. URL https://doi.org/10.1145/3287560.3287572.
  16. Disentangling factors of variation via generative entangling. Computing Research Repository, arXiv:1210.5474, 2012.
  17. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html.
  18. Causal analysis of syntactic agreement mechanisms in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1828–1843, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.144. URL https://aclanthology.org/2021.acl-long.144.
  19. Interpreting CLIP’s image representation via text-based decomposition. Computing Research Repository, arXiv:2310.05916, 2024.
  20. The Pile: An 800GB dataset of diverse text for language modeling, 2020.
  21. Causal abstractions of neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  9574–9586. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/4f5c422f4d49a5a807eda27434231040-Paper.pdf.
  22. Inducing causal structure for interpretable neural networks. 162:7324–7338, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/geiger22a.html.
  23. Causal abstraction for faithful model interpretation. Computing Research Repository, arXiv:2301.04709, 2023.
  24. Dissecting recall of factual associations in auto-regressive language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  12216–12235, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.751. URL https://aclanthology.org/2023.emnlp-main.751.
  25. Successor heads: Recurring, interpretable attention heads in the wild. Computing Research Repository, arXiv:2312.09230, 2023.
  26. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=p4PckNQR8k.
  27. The unreasonable effectiveness of easy training data for hard tasks, 2024.
  28. Exploring linear feature disentanglement for neural networks. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pp.  1–6, Los Alamitos, CA, USA, jul 2022. IEEE Computer Society. doi: 10.1109/ICME52920.2022.9859978. URL https://doi.ieeecomputersociety.org/10.1109/ICME52920.2022.9859978.
  29. beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.
  30. Simple data balancing achieves competitive worst-group-accuracy. In Bernhard Schölkopf, Caroline Uhler, and Kun Zhang (eds.), Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 of Proceedings of Machine Learning Research, pp. 336–351. PMLR, 11–13 Apr 2022. URL https://proceedings.mlr.press/v177/idrissi22a.html.
  31. Shielded representations: Protecting sensitive attributes through iterative gradient-based projection. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  5961–5977, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.369. URL https://aclanthology.org/2023.findings-acl.369.
  32. Leveraging prototypical representations for mitigating social bias without demographic information. Computing Research Repository, 2403.09516, 2024.
  33. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pp. 2668–2677. PMLR, 2018.
  34. Disentangling by factorising. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  2649–2658. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/kim18b.html.
  35. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:6628106.
  36. Last layer re-training is sufficient for robustness to spurious correlations. Computing Research Repository, arXiv:2204.02937, 2023.
  37. AtP*: An efficient and scalable method for localizing llm behaviour to components, 2024.
  38. David K. Lewis. Counterfactuals. Blackwell, Malden, Mass., 1973.
  39. Just train twice: Improving group robustness without training group information. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  6781–6792. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/liu21f.html.
  40. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017. URL https://api.semanticscholar.org/CorpusID:53592270.
  41. k-sparse autoencoders. Computing Research Repository, abs/1312.5663, 2013. URL https://api.semanticscholar.org/CorpusID:14850799.
  42. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022. arXiv:2202.05262.
  43. The quantization model of neural scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=3tbTw2ga8K.
  44. Learning from failure: Training debiased classifier from biased classifier. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  45. Spread spurious attribute: Improving worst-group accuracy with spurious attribute estimation, 2022.
  46. Neel Nanda. Attribution patching: Activation patching at industrial scale, 2022. URL https://www.neelnanda.io/mechanistic-interpretability/attribution-patching.
  47. Neel Nanda. Open source replication & commentary on Anthropic’s dictionary learning paper, 2023. URL https://www.lesswrong.com/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s.
  48. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, 2023. URL https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall.
  49. The alignment problem from a deep learning perspective. Computing Research Repository, arXiv:2209.00626, 2024.
  50. Label-free concept bottleneck models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=FlCg47MNvBA.
  51. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  52. Distributionally robust language modeling. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4227–4237, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1432. URL https://aclanthology.org/D19-1432.
  53. BLIND: Bias removal with no demographics. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8801–8821, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.490. URL https://aclanthology.org/2023.acl-long.490.
  54. Judea Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, pp.  411–420, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001.
  55. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  56. The hessian penalty: A weak prior for unsupervised disentanglement. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
  57. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In Proceedings of the 2024 International Conference on Learning Representations, 2024. arXiv:2402.14811.
  58. Null it out: Guarding protected attributes by iterative nullspace projection. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7237–7256, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.647. URL https://aclanthology.org/2020.acl-main.647.
  59. Linear adversarial concept erasure. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  18400–18421. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/ravfogel22a.html.
  60. Adversarial concept erasure in kernel space. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  6034–6055, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.405. URL https://aclanthology.org/2022.emnlp-main.405.
  61. Identifiability and exchangeability for direct and indirect effects. Epidemiology, 3(2):143–155, 1992. ISSN 10443983. URL http://www.jstor.org/stable/3702894.
  62. Distributionally robust neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryxGuJrFvS.
  63. Jürgen Schmidhuber. Learning Factorial Codes by Predictability Minimization. Neural Computation, 4(6):863–879, 11 1992. ISSN 0899-7667. doi: 10.1162/neco.1992.4.6.863. URL https://doi.org/10.1162/neco.1992.4.6.863.
  64. Explaining neural networks by decoding layer activations, 2021.
  65. BARACK: Partially supervised group robustness with guarantees, 2022.
  66. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp.  3319–3328. JMLR.org, 2017.
  67. Attribution patching outperforms automated circuit discovery. In NeurIPS Workshop on Attributing Model Behavior at Scale, 2023. URL https://openreview.net/forum?id=tiLbFR4bJW.
  68. Function vectors in large language models. In Proceedings of the 2024 International Conference on Learning Representations, 2024.
  69. Towards debiasing NLU models from unknown biases. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7597–7610, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.613. URL https://aclanthology.org/2020.emnlp-main.613.
  70. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  12388–12401. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf.
  71. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul.
  72. Double-hard debias: Tailoring word embeddings for gender bias mitigation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5443–5453, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.484. URL https://aclanthology.org/2020.acl-main.484.
  73. Increasing robustness to spurious correlations using forgettable examples. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.  3319–3332, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.291. URL https://aclanthology.org/2021.eacl-main.291.
  74. Robust and interpretable medical image classifiers via concept bottleneck models, 2023.
  75. Characterizing mechanisms for factual recall in language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  9924–9959, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.615. URL https://aclanthology.org/2023.emnlp-main.615.
  76. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine, 15(11):e1002683, November 2018. ISSN 1549-1676. doi: 10.1371/journal.pmed.1002683. URL http://dx.doi.org/10.1371/journal.pmed.1002683.
  77. Coping with label shift via distributionally robust optimisation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=BtZhsSGNRNi.
  78. Correct-N-Contrast: A contrastive approach for improving robustness to spurious correlations. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  26484–26516. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/zhang22z.html.
  79. Representation engineering: A top-down approach to AI transparency. Computing Research Repository, arXiv:2310.01405, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Samuel Marks (18 papers)
  2. Can Rager (12 papers)
  3. Eric J. Michaud (17 papers)
  4. Yonatan Belinkov (111 papers)
  5. David Bau (62 papers)
  6. Aaron Mueller (35 papers)
Citations (64)

Summary

Discovering and Editing Interpretable Causal Graphs in LLMs

Introduction to Sparse Feature Circuits

The quest for interpretability in LLMs has led researchers to various avenues, one of which involves understanding the internal mechanisms— or circuits— that contribute to the model's behavior. Traditional approaches focusing on coarse-grained model components such as attention heads or MLP modules have provided valuable insights, yet their polysemantic nature often complicates downstream applications. This paper introduces sparse feature circuits as a scalable and interpretive solution to dissect LLMs' inner workings effectively.

Sparse Feature Circuits and Their Discovery

Sparse feature circuits represent computational sub-graphs within LLMs, emphasizing fine-grained, human-interpretable units. These circuits are constructed using sparse autoencoders (SAEs) trained to identify interpretable directions in the model's latent space, overcoming the challenge of finding appropriate fine-grained units for analysis. The method employs linear approximations— specifically, attribution patching and integrated gradients— for efficient discovery of causally significant sparse features and their connections within the model. This approach notably addresses scalability, facilitating the exploration of LLMs' vast and intricate computational graphs.

Practical Applications and Implications

Sparse feature circuits open new avenues for applying interpretability insights to practical tasks. One such application, Shift (Sparse Human-Interpretable Feature Trimming), leverages these circuits to enhance a classifier's generalization by removing sensitivity to irrelevant features judged by humans. This method demonstrates the potential to debias classifiers without the need for disambiguating labeled data, addressing challenges in scenarios where unintended signals are strongly correlated with target labels.

In addition to targeted applications like Shift, the paper explores an unsupervised pipeline for discovering thousands of sparse feature circuits corresponding to automatically identified model behaviors. This fully automated process initiates with raw text and culminates in detailed interpretations of diverse LLM behaviors, showcasing the scalable nature of the proposed method.

Theoretical and Practical Contributions

The paper contributes significantly to the interpretability research landscape by combining the granular insights offered by sparse feature circuits with scalable discovery methods. The introduction of Shift further exemplifies how these circuits can be pragmatically leveraged to address pressing issues such as model bias and spurious correlations. The unsupervised discovery pipeline represents another cornerstone, providing a comprehensive tool for untangling the complex mechanisms underlying LLM predictions.

Future Directions in AI Interpretability

Looking forward, the development and refinement of sparse feature circuits hold promise for advancing our understanding of LLMs and enhancing their reliability in real-world applications. By shedding light on the specific roles of fine-grained components in model behaviors, researchers can pave the way for more interpretable, fair, and robust AI systems. Furthermore, exploring automated methods for circuit annotation and refinement could streamline the interpretability workflow, making it accessible for a broader range of models and applications.

In conclusion, the advent of sparse feature circuits marks a significant step toward demystifying the black box of LLMs, offering a scalable and interpretable framework for deciphering and editing the causal graphs that drive model behavior.

Youtube Logo Streamline Icon: https://streamlinehq.com