- The paper proposes that integrating causal inference clarifies valid mappings from LLM activations to high-level mechanisms.
- It introduces a diagnostic framework using Pearl’s causal hierarchy to assess claim-evidence alignment in interpretability studies.
- The study highlights identifiability gaps and advocates for causal representation learning to overcome failures in current methods.
Summary of "Causality is Key for Interpretability Claims to Generalise"
Introduction
The paper "Causality is Key for Interpretability Claims to Generalise" (2602.16698) addresses persistent challenges in interpretability research for LLMs, notably the discrepancy between local interpretability successes and their generalization to broader contexts. The authors propose that causal inference can specify valid mappings from model activations to high-level structures, clarifying what interpretability studies can justify according to Pearl's causal hierarchy. Observations might establish associations, and interventions could support claims on edits affecting behavioral metrics over prompts. However, counterfactual claims remain largely unverifiable without controlled supervision. Causal representation learning (CRL) operationalizes this hierarchy by specifying recoverable variables from activations under particular assumptions. This motivates a diagnostic framework for selecting methods and evaluations that match claims to evidence for findings that generalize.
Interpretability and Causality
Interpretability research aims to link model behavior to internal structure, yet high predictive accuracy does not ensure manipulable mechanisms. The authors argue that interpretability claims should align with causal inference terminology: determine estimands, intervention classes, and equivalence classes based on evidence. Pearl's causality ladder characterizes the strength of evidence needed for causal claims, aiding in diagnosing claim-evidence mismatches within interpretability research.
Identifiability and Affordances in Interpretability
Identifiability is crucial for validating interpretability claims, defining what quantities can be uniquely determined from available evidence. Identifiability results in CRL suggest that unsupervised recovery of latents is impossible without additional structure. CRL offers ways to interpret LLM activations causally by determining correct assumptions to identify concept classes. Affordances (interactions possible in a system) guide interpretable structure, with identifiability characterizing the invariance of variables observable under interactions. Applying this understanding to interpretability research helps determine when recovered features are meaningful beyond proxies like reconstruction error or sparsity.
Diagnosing Failures in Interpretability
Using causality theory, the authors catalogate failures where claimed mechanisms don't match observed evidence. Such failures result from rung mismatches or identification gaps, clarified using their diagnostic framework. Common inferential gaps include sufficiency being mistaken for necessity and assuming uniqueness where multiple explanations may fit the observed data equally well. Through practical examples—activation patching, SAE feature alignment, and steering vectors—the paper illustrates under what structural assumptions claims would generalize.
Alternative Views on Interpretability
The paper explores alternative perspectives like pragmatic interpretability, where proxy tasks aim to connect mechanistic findings with practical deployment success, and symmetry-based interpretation, focusing on preserving the meaning of explanations under transformation. It discusses how these approaches align or diverge from causal framing and identifiability, underscoring potential improvements from integrating causal inference with current methodologies.
Conclusion
The paper advocates for integrating causal inference into interpretability research, promoting a rigorous framework for matching claims to evidence. Aligning interpretability claims with causal inference terminology and evaluating methodologies against identifiability theory predictions offers pathways for developing reliable and generalizable interpretability tools. This integration can assist in achieving dependable control handles within AI, necessary for safe deployment in real-world applications.
The conclusions show potential directions where CRL and mechanistic interpretability meet, emphasizing structural causality, task-relative equivalence classes, compositional control, and better understanding transportability for safe edits and steering in AI models. Pursuing these directions can enhance both fields, merging empirical phenomena and theoretical insights for practical AI safety.