Causality is Key for Interpretability Claims to Generalise

Published 18 Feb 2026 in cs.LG | (2602.16698v1)

Abstract: Interpretability research on LLMs has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can support. Specifically, Pearl's causal hierarchy clarifies what an interpretability study can justify. Observations establish associations between model behaviour and internal components. Interventions (e.g., ablations or activation patching) support claims how these edits affect a behavioural metric (\eg, average change in token probabilities) over a set of prompts. However, counterfactual claims -- i.e., asking what the model output would have been for the same prompt under an unobserved intervention -- remain largely unverifiable without controlled supervision. We show how causal representation learning (CRL) operationalises this hierarchy, specifying which variables are recoverable from activations and under what assumptions. Together, these motivate a diagnostic framework that helps practitioners select methods and evaluations matching claims to evidence such that findings generalise.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper proposes that integrating causal inference clarifies valid mappings from LLM activations to high-level mechanisms.
It introduces a diagnostic framework using Pearl’s causal hierarchy to assess claim-evidence alignment in interpretability studies.
The study highlights identifiability gaps and advocates for causal representation learning to overcome failures in current methods.

Summary of "Causality is Key for Interpretability Claims to Generalise"

Introduction

The paper "Causality is Key for Interpretability Claims to Generalise" (2602.16698) addresses persistent challenges in interpretability research for LLMs, notably the discrepancy between local interpretability successes and their generalization to broader contexts. The authors propose that causal inference can specify valid mappings from model activations to high-level structures, clarifying what interpretability studies can justify according to Pearl's causal hierarchy. Observations might establish associations, and interventions could support claims on edits affecting behavioral metrics over prompts. However, counterfactual claims remain largely unverifiable without controlled supervision. Causal representation learning (CRL) operationalizes this hierarchy by specifying recoverable variables from activations under particular assumptions. This motivates a diagnostic framework for selecting methods and evaluations that match claims to evidence for findings that generalize.

Interpretability and Causality

Interpretability research aims to link model behavior to internal structure, yet high predictive accuracy does not ensure manipulable mechanisms. The authors argue that interpretability claims should align with causal inference terminology: determine estimands, intervention classes, and equivalence classes based on evidence. Pearl's causality ladder characterizes the strength of evidence needed for causal claims, aiding in diagnosing claim-evidence mismatches within interpretability research.

Identifiability and Affordances in Interpretability

Identifiability is crucial for validating interpretability claims, defining what quantities can be uniquely determined from available evidence. Identifiability results in CRL suggest that unsupervised recovery of latents is impossible without additional structure. CRL offers ways to interpret LLM activations causally by determining correct assumptions to identify concept classes. Affordances (interactions possible in a system) guide interpretable structure, with identifiability characterizing the invariance of variables observable under interactions. Applying this understanding to interpretability research helps determine when recovered features are meaningful beyond proxies like reconstruction error or sparsity.

Diagnosing Failures in Interpretability

Using causality theory, the authors catalogate failures where claimed mechanisms don't match observed evidence. Such failures result from rung mismatches or identification gaps, clarified using their diagnostic framework. Common inferential gaps include sufficiency being mistaken for necessity and assuming uniqueness where multiple explanations may fit the observed data equally well. Through practical examples—activation patching, SAE feature alignment, and steering vectors—the paper illustrates under what structural assumptions claims would generalize.

Alternative Views on Interpretability

The paper explores alternative perspectives like pragmatic interpretability, where proxy tasks aim to connect mechanistic findings with practical deployment success, and symmetry-based interpretation, focusing on preserving the meaning of explanations under transformation. It discusses how these approaches align or diverge from causal framing and identifiability, underscoring potential improvements from integrating causal inference with current methodologies.

Conclusion

The paper advocates for integrating causal inference into interpretability research, promoting a rigorous framework for matching claims to evidence. Aligning interpretability claims with causal inference terminology and evaluating methodologies against identifiability theory predictions offers pathways for developing reliable and generalizable interpretability tools. This integration can assist in achieving dependable control handles within AI, necessary for safe deployment in real-world applications.

The conclusions show potential directions where CRL and mechanistic interpretability meet, emphasizing structural causality, task-relative equivalence classes, compositional control, and better understanding transportability for safe edits and steering in AI models. Pursuing these directions can enhance both fields, merging empirical phenomena and theoretical insights for practical AI safety.

Markdown Report Issue