Evaluate unsupervised behavior clusters and their sparse feature circuits
Develop evaluation methodologies for the behavior clusters produced by clustering contexts using Pythia-70M activations and/or gradients (following the quanta discovery approach of Michaud et al., 2023) and for the sparse feature circuits derived for those clusters, in order to validate that the discovered clusters reflect consistent model behaviors and that the associated circuits faithfully explain the model’s predictions.
References
While evaluating these clusters and circuits is an important open problem, we generally find that these clusters expose interesting LM behaviors, and that their respective feature circuits can provide useful insights on mechanisms of LM behavior.
— Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
(2403.19647 - Marks et al., 28 Mar 2024) in Section 5, Unsupervised Circuit Discovery at Scale