Evaluate unsupervised behavior clusters and their sparse feature circuits

Develop evaluation methodologies for the behavior clusters produced by clustering contexts using Pythia-70M activations and/or gradients (following the quanta discovery approach of Michaud et al., 2023) and for the sparse feature circuits derived for those clusters, in order to validate that the discovered clusters reflect consistent model behaviors and that the associated circuits faithfully explain the model’s predictions.

Background

The paper introduces a near fully-automated pipeline that first discovers model behaviors by clustering contexts using vectors derived from Pythia-70M activations, gradients, or both, and then discovers sparse feature circuits for each cluster by estimating indirect effects and selecting implicated features.

While qualitative examples (e.g., succession/induction and infinitival "to" mechanisms) suggest the clusters and circuits are meaningful, the authors explicitly note that establishing rigorous evaluation for these automatically discovered clusters and circuits remains unresolved.

References

While evaluating these clusters and circuits is an important open problem, we generally find that these clusters expose interesting LM behaviors, and that their respective feature circuits can provide useful insights on mechanisms of LM behavior.

— Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (2403.19647 - Marks et al., 28 Mar 2024) in Section 5, Unsupervised Circuit Discovery at Scale

Evaluate unsupervised behavior clusters and their sparse feature circuits

Background

References

Related Problems