Determine generalizability of mechanistic interpretability findings across model families
Determine the extent to which mechanistic interpretability findings derived from CNN-based image models, BERT-based text models, and GPT-based language models generalize to other architectures and application contexts.
References
The degree of generalizability of these findings to other models and contexts is currently a somewhat open question.
— Open Problems in Mechanistic Interpretability
(2501.16496 - Sharkey et al., 27 Jan 2025) in Mechanistic interpretability on a broader range of models and model families (Section 3.6)