Universality of Features and Circuits

Determine the degrees of universality of features and circuits across transformer-based language models and tasks, and ascertain how this universality depends on model training factors such as random initialization, model size, and the loss function used during training.

Background

Mechanistic interpretability investigates whether similar features and circuits recur across different LLMs and tasks, a property referred to as universality. Some studies report recurring components (e.g., induction heads, successor heads), while other work shows qualitatively different circuits emerging under different initializations or low rates of universal neurons across GPT-2 models.

Because many mechanistic analyses have been performed on toy or small models, establishing universality would enable transferring insights to larger models with less bespoke effort. Mixed empirical findings highlight the need to rigorously characterize the extent and conditions under which universality holds.

References

Understanding the degrees of feature and circuit universality and their dependency on various aspects of model training (e.g., initialization, model size, and loss function) remains a crucial open problem.

— A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models (2407.02646 - Rai et al., 2 Jul 2024) in Findings and Applications, Findings on Universality (Section 7, Subsection "Findings on Universality")

Universality of Features and Circuits

Sponsor

Background

References

Related Problems