Algorithmic interpretability of mechanisms learned by large language models

Determine whether the internal mechanisms learned during training by transformer-based large language models (e.g., GPT-style models) can be characterized by succinct, algorithmic descriptions that are identifiable and verifiable via mechanistic interpretability methods across diverse tasks and model scales.

Background

The chapter reviews transformer architectures and recent progress in mechanistic interpretability, noting work that has reverse-engineered specific circuits (e.g., induction heads and syntax-sensitive heads). Despite these advances, the authors emphasize uncertainty about the generality and compactness of algorithmic explanations for mechanisms learned across tasks and scales. Clarifying whether learned mechanisms admit succinct algorithmic descriptions would inform both cognitive modeling and AI safety by bridging behavior and internal computation.

References

Rapid progress has been made in methods for discovering the circuits underlying a given capacity in trained LLMs [“mechanistic interpretability”], but it is an important open question whether the mechanisms learned by LLMs in the course of training will all afford these kinds of succinct algorithmic descriptions.

From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks (2405.15164 - Russin et al., 24 May 2024) in Section 3, Neural Networks in the Age of Deep Learning (Transformer discussion)