Universality of Features and Circuits
Determine the degrees of universality of features and circuits across transformer-based language models and tasks, and ascertain how this universality depends on model training factors such as random initialization, model size, and the loss function used during training.
References
Understanding the degrees of feature and circuit universality and their dependency on various aspects of model training (e.g., initialization, model size, and loss function) remains a crucial open problem.
— A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
(2407.02646 - Rai et al., 2 Jul 2024) in Findings and Applications, Findings on Universality (Section 7, Subsection "Findings on Universality")