Circuit Component Reuse Across Tasks in Transformer Language Models (2310.08744v3)

Published 12 Oct 2023 in cs.CL and cs.LG

Abstract: Recent work in mechanistic interpretability has shown that behaviors in LLMs can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito & Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain LLMs' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.

References (25)

Citations (43)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/jack_merullo_/status/1748438563232887072

https://twitter.com/health_nlp/status/1787777487650402508

https://twitter.com/jack_merullo_/status/1754934752870641760

Circuit Component Reuse Across Tasks in Transformer Language Models (2310.08744v3)

Summary

Related Papers

Tweets