Algorithmic interpretability of mechanisms learned by large language models
Determine whether the internal mechanisms learned during training by transformer-based large language models (e.g., GPT-style models) can be characterized by succinct, algorithmic descriptions that are identifiable and verifiable via mechanistic interpretability methods across diverse tasks and model scales.
Sponsor
References
Rapid progress has been made in methods for discovering the circuits underlying a given capacity in trained LLMs [“mechanistic interpretability”], but it is an important open question whether the mechanisms learned by LLMs in the course of training will all afford these kinds of succinct algorithmic descriptions.
— From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks
(2405.15164 - Russin et al., 24 May 2024) in Section 3, Neural Networks in the Age of Deep Learning (Transformer discussion)