Causal Fidelity Score in Concept Graphs

Updated 18 March 2026

Causal Fidelity Score (CFS) is an intervention-based metric that quantifies the true causal relationships in learned Concept Graphs.
It measures causal reach by comparing the effects of ablating high-centrality nodes versus random nodes in LLM latent spaces.
Empirical benchmarks indicate that higher CFS values robustly distinguish genuine causal influence from mere correlations.

The Causal Fidelity Score (CFS) is an intervention-based quantitative metric that evaluates the extent to which a learned Causal Concept Graph (CCG) over sparse latent representations in LLMs encodes genuine causal relationships, rather than mere statistical correlations. CFS assesses whether targeted ablations of high-centrality concepts—identified through the graph structure—produce more significant downstream changes in their children concepts than comparable ablations at randomly chosen nodes, thereby measuring the “causal reach” of the graph’s most influential components (Meherab et al., 11 Mar 2026).

1. Foundational Motivation

The core motivation for CFS arises from the interpretability challenges in learned concept graphs over LLM latent spaces. Sparse autoencoders localize interpretable features (“concepts”) but do not, in isolation, elucidate the mechanisms of conceptual interaction during multi-step reasoning. The directed acyclic graph learned in a CCG proposes edges as putative causal dependencies. However, mere edge presence does not guarantee genuine causal influence. CFS addresses this by formalizing an intervention protocol: an explicit experiment that compares the effect of ablating (zeroing the activation of) high-centrality nodes versus random nodes, evaluating the difference in downstream impact on their children concepts.

A plausible implication is that a consistently high CFS across tasks signals robust graph structure and non-trivial encoding of reasoning dynamics in the LLM’s latent space.

2. Formal Mathematical Definition

Let $W \in \mathbb{R}^{M \times M}$ denote the learned weighted adjacency matrix over the top $M$ discovered concepts. For node $i$ , its set of “children” is defined as

$D_i = \{j : W_{ij} > 0.01\},$

where $W_{ij}$ exceeds a fixed edge threshold.

The intervention on node $i$ is realized by ablating its activation: $c_i \leftarrow 0$ . The downstream effect $\Delta_i$ is computed as the average $\ell_1$ change in the children’s predicted activations:

$\Delta_i = \frac{1}{|D_i|} \sum_{j \in D_i} \| [CW]_{\cdot j}\big|_{c_i=0} - [CW]_{\cdot j}\big|_{\text{orig}} \|_1.$

To assess causal specificity, $S$ high-centrality nodes $\{i_c^{(s)}\}$ (chosen by out-degree) and $S$ random indices $\{i_r^{(s)}\}$ are selected ( $S=20$ ). For each $s=1, \dots, S$ , form the ratio:

$r_s = \frac{\Delta_{i_c^{(s)}}}{\max(\Delta_{i_r^{(s)}},\ \delta)},$

with $\delta=10^{-3}$ to avoid zero division. Clipping is enforced with $\tau=10$ :

$r_s \leftarrow \min(r_s, \tau).$

The final score aggregates over $S$ samples:

$\text{CFS} = \frac{1}{S} \sum_{s=1}^S r_s.$

3. Computational Protocol

CFS is computed through a structured four-step process:

Intervention Target Selection: Nodes are ranked by out-degree. The top $S=20$ are chosen for graph-guided ablation. Independently, $S=20$ nodes are sampled uniformly as a random baseline.
Ablation and Response Measurement: For each chosen node $i$ , its activation is set to zero ( $c_i=0$ ); all other concept activations are unchanged. The downstream prediction $CW$ is recalculated, and the $\ell_1$ -difference $\Delta_i$ is averaged over all children $j\in D_i$ .
Pairwise Ratio Computation: For each $s$ , calculate $r_s$ as the ratio of $\Delta_{i_c^{(s)}}$ to $\max(\Delta_{i_r^{(s)}}, \delta)$ , then clip with $\tau=10$ .
Aggregation: Compute the mean value across all $S$ trials.

This methodology ensures both the statistical stability of the metric (via paired sampling) and its robustness to the graph’s sparsity or anomalies in the effect sizes.

4. Properties, Bounds, and Model Assumptions

CFS has several carefully engineered properties:

Chance-level Baseline: CFS approaches 1 if ablations at graph-selected nodes exert effects comparable to random ablations, i.e., when identified graph structures have low causal specificity.
Sensitivity to True Influence: CFS exceeds 1 when the graph’s “influential” nodes reliably induce outsized downstream disruptions relative to random nodes.
Sparsity-robust Lower Bound: The floor parameter $\delta$ ( $10^{-3}$ ) ensures ratios are well-defined, even in highly sparse graphs where random nodes may have no children ( $\Delta_{i_r}=0$ ). The cap $\tau$ (10) constrains heavy-tailed ratios, yielding a conservative lower bound on CFS.
Structural Model Constraint: The underlying structural equation model is linear (i.e., $C \approx CW$ ), and the edge threshold $W_{ij}>0.01$ governs graph sparsity. Extensions to nonlinear or multilayer SCMs are not explored in the original work.

This design provides a stable, interpretable metric for causal graph evaluation that is robust under model and data constraints.

5. Empirical Evaluation and Comparative Results

CFS was evaluated across three reasoning-focused NLP benchmarks—ARC-Challenge, StrategyQA, and LogiQA—using GPT-2 Medium. In each case, $n=15$ paired runs (across five random seeds on 300 examples per task) compared CCGs, ROME-style tracing, feature-only ranking, and random baselines. Results are summarized below:

Benchmark	CCG (CFS)	ROME	SAE-only	Random
ARC-Challenge	$5.729 \pm 0.875$	$3.488 \pm 0.203$	$2.552 \pm 0.189$	$1.032 \pm 0.034$
StrategyQA	$5.461 \pm 0.405$	$3.205 \pm 0.179$	$2.399 \pm 0.170$	$1.032 \pm 0.034$
LogiQA	$5.771 \pm 0.431$	$3.452 \pm 0.204$	$2.487 \pm 0.196$	$1.032 \pm 0.034$
Avg.	$5.654 \pm 0.625$	$3.382 \pm 0.233$	$2.479 \pm 0.196$	$1.032 \pm 0.034$

Statistical testing (one-sided paired t-tests, Bonferroni corrected, $n=15$ ) confirmed strong effect sizes:

CCG vs. ROME: $t=14.32$ , $p<0.0001$ , Cohen’s $d=4.82$
CCG vs. SAE-only: $t=19.83$ , $p<0.0001$ , $d=6.86$
CCG vs. Random: $t=27.95$ , $p<0.0001$ , $d=10.45$

These results indicate substantial, statistically significant performance improvement for CCGs over competitive baselines (Meherab et al., 11 Mar 2026).

6. Interpretation and Analytical Significance

A high CFS (≫1) signifies that the learned CCG is reliably identifying concepts where intervention (ablation) causes substantially greater downstream change to their children than random ablations. This is interpreted as evidence that the concept graph has learned non-trivial, causally salient structure, rather than capturing spurious statistical relations. Conversely, a CFS close to $1$ would indicate that graph-identified “influential” nodes are no more likely than random nodes to affect child concept activations, implying poor causal fidelity.

Because the CFS is explicitly clipped (at $\tau=10$ ) and floored (at $\delta=10^{-3}$ ), the metric is a conservative lower bound on the graph’s “true” causal reach; this is particularly relevant in sparse graphs, where many random nodes might otherwise yield undefined or inflated ratios. The results support the stability and domain-specificity of the learned CCGs, with reported sparse edge densities ($5$– $6\%$ ) and robustness across seeds.

7. Contextual and Methodological Implications

CFS contributes a principled evaluation mechanism for causal graph discovery in LLM latent spaces, emphasizing intervention over correlation as the key criterion. The reliance on do-calculus-style ablations grounds the metric in the interventionist tradition of causal inference. The conservative nature of its construction highlights its utility for reliable benchmarking in future work, including potential extension to nonlinear or multilayer SCMs.

Current limitations include the restriction to linear SEMs and the need for threshold selection for graph edges. Further work may investigate the adaptation of CFS to more complex structural motifs or richer types of node interventions.

CFS thus operationalizes causal evaluation for learned concept graphs and demonstrates, in the context of LLM reasoning benchmarks, that structure-informed intervention metrics provide rigorous, stable, and informative measures of latent causal dynamics (Meherab et al., 11 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Causal Fidelity Score (CFS).