Causal Fidelity Score in Concept Graphs
- Causal Fidelity Score (CFS) is an intervention-based metric that quantifies the true causal relationships in learned Concept Graphs.
- It measures causal reach by comparing the effects of ablating high-centrality nodes versus random nodes in LLM latent spaces.
- Empirical benchmarks indicate that higher CFS values robustly distinguish genuine causal influence from mere correlations.
The Causal Fidelity Score (CFS) is an intervention-based quantitative metric that evaluates the extent to which a learned Causal Concept Graph (CCG) over sparse latent representations in LLMs encodes genuine causal relationships, rather than mere statistical correlations. CFS assesses whether targeted ablations of high-centrality concepts—identified through the graph structure—produce more significant downstream changes in their children concepts than comparable ablations at randomly chosen nodes, thereby measuring the “causal reach” of the graph’s most influential components (Meherab et al., 11 Mar 2026).
1. Foundational Motivation
The core motivation for CFS arises from the interpretability challenges in learned concept graphs over LLM latent spaces. Sparse autoencoders localize interpretable features (“concepts”) but do not, in isolation, elucidate the mechanisms of conceptual interaction during multi-step reasoning. The directed acyclic graph learned in a CCG proposes edges as putative causal dependencies. However, mere edge presence does not guarantee genuine causal influence. CFS addresses this by formalizing an intervention protocol: an explicit experiment that compares the effect of ablating (zeroing the activation of) high-centrality nodes versus random nodes, evaluating the difference in downstream impact on their children concepts.
A plausible implication is that a consistently high CFS across tasks signals robust graph structure and non-trivial encoding of reasoning dynamics in the LLM’s latent space.
2. Formal Mathematical Definition
Let denote the learned weighted adjacency matrix over the top discovered concepts. For node , its set of “children” is defined as
where exceeds a fixed edge threshold.
The intervention on node is realized by ablating its activation: . The downstream effect is computed as the average change in the children’s predicted activations:
To assess causal specificity, high-centrality nodes (chosen by out-degree) and random indices are selected (). For each , form the ratio:
with to avoid zero division. Clipping is enforced with :
The final score aggregates over samples:
3. Computational Protocol
CFS is computed through a structured four-step process:
- Intervention Target Selection: Nodes are ranked by out-degree. The top are chosen for graph-guided ablation. Independently, nodes are sampled uniformly as a random baseline.
- Ablation and Response Measurement: For each chosen node , its activation is set to zero (); all other concept activations are unchanged. The downstream prediction is recalculated, and the -difference is averaged over all children .
- Pairwise Ratio Computation: For each , calculate as the ratio of to , then clip with .
- Aggregation: Compute the mean value across all trials.
This methodology ensures both the statistical stability of the metric (via paired sampling) and its robustness to the graph’s sparsity or anomalies in the effect sizes.
4. Properties, Bounds, and Model Assumptions
CFS has several carefully engineered properties:
- Chance-level Baseline: CFS approaches 1 if ablations at graph-selected nodes exert effects comparable to random ablations, i.e., when identified graph structures have low causal specificity.
- Sensitivity to True Influence: CFS exceeds 1 when the graph’s “influential” nodes reliably induce outsized downstream disruptions relative to random nodes.
- Sparsity-robust Lower Bound: The floor parameter () ensures ratios are well-defined, even in highly sparse graphs where random nodes may have no children (). The cap (10) constrains heavy-tailed ratios, yielding a conservative lower bound on CFS.
- Structural Model Constraint: The underlying structural equation model is linear (i.e., ), and the edge threshold governs graph sparsity. Extensions to nonlinear or multilayer SCMs are not explored in the original work.
This design provides a stable, interpretable metric for causal graph evaluation that is robust under model and data constraints.
5. Empirical Evaluation and Comparative Results
CFS was evaluated across three reasoning-focused NLP benchmarks—ARC-Challenge, StrategyQA, and LogiQA—using GPT-2 Medium. In each case, paired runs (across five random seeds on 300 examples per task) compared CCGs, ROME-style tracing, feature-only ranking, and random baselines. Results are summarized below:
| Benchmark | CCG (CFS) | ROME | SAE-only | Random |
|---|---|---|---|---|
| ARC-Challenge | ||||
| StrategyQA | ||||
| LogiQA | ||||
| Avg. |
Statistical testing (one-sided paired t-tests, Bonferroni corrected, ) confirmed strong effect sizes:
- CCG vs. ROME: , , Cohen’s
- CCG vs. SAE-only: , ,
- CCG vs. Random: , ,
These results indicate substantial, statistically significant performance improvement for CCGs over competitive baselines (Meherab et al., 11 Mar 2026).
6. Interpretation and Analytical Significance
A high CFS (≫1) signifies that the learned CCG is reliably identifying concepts where intervention (ablation) causes substantially greater downstream change to their children than random ablations. This is interpreted as evidence that the concept graph has learned non-trivial, causally salient structure, rather than capturing spurious statistical relations. Conversely, a CFS close to $1$ would indicate that graph-identified “influential” nodes are no more likely than random nodes to affect child concept activations, implying poor causal fidelity.
Because the CFS is explicitly clipped (at ) and floored (at ), the metric is a conservative lower bound on the graph’s “true” causal reach; this is particularly relevant in sparse graphs, where many random nodes might otherwise yield undefined or inflated ratios. The results support the stability and domain-specificity of the learned CCGs, with reported sparse edge densities ($5$–) and robustness across seeds.
7. Contextual and Methodological Implications
CFS contributes a principled evaluation mechanism for causal graph discovery in LLM latent spaces, emphasizing intervention over correlation as the key criterion. The reliance on do-calculus-style ablations grounds the metric in the interventionist tradition of causal inference. The conservative nature of its construction highlights its utility for reliable benchmarking in future work, including potential extension to nonlinear or multilayer SCMs.
Current limitations include the restriction to linear SEMs and the need for threshold selection for graph edges. Further work may investigate the adaptation of CFS to more complex structural motifs or richer types of node interventions.
CFS thus operationalizes causal evaluation for learned concept graphs and demonstrates, in the context of LLM reasoning benchmarks, that structure-informed intervention metrics provide rigorous, stable, and informative measures of latent causal dynamics (Meherab et al., 11 Mar 2026).