Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii (2505.01372v1)

Published 2 May 2025 in cs.LG, cs.AI, cs.CL, and cs.HC

Abstract: Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question "What makes a good explanation?" We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science - the Bayesian, Kuhnian, Deutschian, and Nomological - to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor, predict, and steer AI systems.

Summary

The paper establishes a unified framework that applies philosophical virtues to assess and compare neural network explanations.
It rigorously evaluates common MI methods, balancing criteria such as Accuracy, Simplicity, and Falsifiability.
The study proposes practical metrics for virtues like Hard-to-Varyness, Unification, and Nomologicity to guide future research.

Mechanistic Interpretability (MI) aims to understand neural networks by finding causal explanations for their behavior. However, a significant challenge in the field is the lack of a unified framework for evaluating the quality of these explanations. Without clear criteria, researchers face difficulties in comparing different explanatory methods or determining which of two inconsistent explanations for the same phenomenon is superior. This paper introduces the Explanatory Virtues Framework, which adapts concepts from the philosophy of science to provide a systematic approach for evaluating and improving explanations in MI.

The framework posits that good explanations possess certain "explanatory virtues," properties that are reliable indicators of truth and should be valued in scientific inquiry. These virtues are drawn from four philosophical perspectives:

Bayesian: Focuses on how well an explanation fits the data and its predictive power. Key virtues include Accuracy (fitting observed data, quantified by log-likelihood), Precision (making constraining predictions, expected log-likelihood), Prior (prior probability of the explanation), Descriptiveness (explaining individual data points), Co-Explanation (explaining multiple data points jointly), Power (constraining predictions about individual points), and Unification (connecting multiple disparate observations).
Kuhnian: Adds criteria related to the theory's structure and success. Virtues include Consistency (lack of internal contradictions), Scope (Unification), Simplicity (lack of unnecessary complexity), and Fruitfulness (predicting novel phenomena). Simplicity is further broken down into Parsimony (number of entities), Conciseness (description length), and K-Complexity (Kolmogorov complexity). Fruitfulness is analogous to generalization performance on held-out data, including novel empirical success on out-of-distribution data or utility in downstream tasks.
Deutschian: Emphasizes testability and robustness. Virtues are Falsifiability (yielding testable predictions) and Hard-to-Varyness (being difficult to modify without reducing accuracy or increasing complexity). Hard-to-Varyness implies the explanation's components are essential and not easily swapped or removed.
Nomological: Highlights the role of general laws. The virtue is Nomologicity (appealing to or deriving universal principles about neural networks).

For an explanation to be considered a valid MI explanation, it must satisfy four core criteria: Model-level (focusing on the network, not system properties), Ontic (referring to real entities within the model), Causal-Mechanistic (identifying step-by-step causal chains), and Falsifiable (testable). The Explanatory Virtues then serve to assess the quality of these valid explanations, providing reasons to prefer one over another. The paper suggests Simplicity, Fruitfulness, and Hard-to-Varyness are particularly important for distinguishing good explanations.

The framework is used to analyze common MI methods:

Clustering: Provides simple, falsifiable explanations but often lacks Causal-Mechanistic detail and high Accuracy or Fruitfulness for complex models.
Sparse Autoencoders (SAEs): Methods like MDL-SAEs (2501.14608) are strong on Accuracy (reconstruction error, patching performance) and Simplicity (Conciseness via description length). They exhibit Falsifiability and some Fruitfulness. However, they often fall short on Unification, Co-Explanation, Hard-to-Varyness (features may be ad-hoc or not causally relevant to downstream tasks), and Nomologicity.
Causal Abstraction Explanations of Circuits: Focus on identifying causal pathways and are strong on Causal-Mechanisticity, Falsifiability, Accuracy, and Fruitfulness under interventions (Faithfulness). They aim for a type of simplicity (Minimality) but can be complex. They typically lack Unification, Co-Explanation (circuits are often task/instance-specific), and Nomologicity.
Compact Proofs (2405.01538): This method evaluates existing explanations by translating them into formal proofs about model behavior. It directly optimizes for Accuracy (tightness of performance bounds) and Simplicity (computational efficiency/compactness of the proof). It requires explanations to be Causal-Mechanistic and supports assessment of Precision and Hard-to-Varyness. Scaling to large models remains a challenge.

A key finding from this analysis (summarized in Table 1 of the paper) is that while methods generally value Accuracy, Falsifiability, and Causal-Mechanisticity, they often neglect Simplicity (specifically, having a clear, quantifiable measure), Unification, Co-Explanation, and Nomologicity.

Practical Implementation and Application:

Applying the Explanatory Virtues Framework in practice involves integrating measures of these virtues into the development and evaluation pipelines for MI methods.

Quantifying Virtues:
- Accuracy & Fruitfulness: Standard ML metrics like log-likelihood, prediction accuracy on test sets (including distribution shifts), reconstruction error for representation explanations (like SAEs), or performance on downstream tasks provide measures. For circuits, this involves testing behavioral changes under interventions/ablations.
- Simplicity:
  - Conciseness/Description Length: Implement MDL principles. For SAEs, this means defining an encoding scheme for the dictionary and activations and calculating total bit length (2501.14608).
  - K-Complexity/Proof Compactness: For methods producing formal guarantees (like Compact Proofs), this is quantified by the computational resources (e.g., FLOPs) required to verify the explanation or the bound it provides (2405.01538).
- Hard-to-Varyness: Requires performing robustness analyses. Perturb components of the explanation or the model elements it refers to and measure the impact on the explanation's fidelity or the model's predicted behavior. Define metrics for the "cost" of perturbation and evaluate the function $hv(E) = \log(Acc(E)) - k(E)$ (where $k$ is complexity) to find explanations at local maxima.
- Unification & Co-Explanation: Develop metrics that measure the reusability of explanatory components across different inputs, tasks, or even models. For circuit analysis, this could involve metrics for how often specific sub-circuits appear in explanations for different behaviors. For SAEs, it's about how the feature dictionary can explain activations across various contexts.
- Nomologicity: This is harder to quantify. It might involve assessing if an explanation instantiates or is derivable from a proposed general principle, or if the explanation itself proposes a principle that generalizes across contexts.
Evaluation Pipelines: MI research benches (like SAEBench (2402.13036)) can be extended to include metrics for Conciseness, Fruitfulness on diverse test sets, and potentially proxies for Hard-to-Varyness or Unification. Researchers can compare methods not just on accuracy or reconstruction error, but on their Pareto fronts across multiple virtues (e.g., Accuracy vs. Simplicity).

The Road Ahead - Implications for MI Research:

The framework points to promising directions for future MI research by highlighting neglected virtues:

Defining and Prioritizing Simplicity: Developing clear, quantifiable measures of simplicity (like Conciseness or K-Complexity proxies) for different explanation types will enable rigorous comparisons on the accuracy-simplicity trade-off curve, which is crucial for useful explanations.
Focusing on Unification and Co-Explanation: Methods that can explain multiple phenomena or apply across different tasks/models using shared explanatory components (like identifying universal circuit motifs or reusable features) will lead to more compressed and powerful understanding.
Seeking Nomological Principles: Moving towards a nomothetic approach, seeking general laws or principles governing neural network behavior (drawing inspiration from fields like Developmental Interpretability (2402.19411) or Physics of AI (2310.18561)), can provide a unifying theoretical basis for MI and lead to simpler, more generalizable explanations.

Implementation Considerations and Trade-offs:

Quantifying all virtues rigorously can be challenging. Proxies may be necessary.
Virtues are often in tension (e.g., maximizing Accuracy might increase Complexity). Practical application requires navigating these trade-offs based on the specific goals of interpretation (e.g., debugging vs. scientific understanding).
Computational cost is a factor, particularly for methods like Hard-to-Varyness analysis or Compact Proofs applied to large networks.
The framework provides criteria, but the development of algorithms that generate explanations optimizing for these virtues is a separate challenge.

In summary, the Explanatory Virtues Framework provides a valuable, philosophically-grounded lens for evaluating the quality of Mechanistic Interpretability explanations. By encouraging researchers to explicitly consider and quantify virtues beyond mere accuracy and fidelity (particularly Simplicity, Unification, and Nomologicity), it offers a path towards developing more reliable, insightful, and ultimately more useful methods for understanding complex AI systems. Implementing this framework means developing appropriate metrics and evaluation procedures for these virtues within MI research.

PDF Markdown

YouTube

Show All Videos

Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii (2505.01372v1)

Summary

Related Papers

YouTube