Papers
Topics
Authors
Recent
Search
2000 character limit reached

Causal Fidelity Score in Concept Graphs

Updated 18 March 2026
  • Causal Fidelity Score (CFS) is an intervention-based metric that quantifies the true causal relationships in learned Concept Graphs.
  • It measures causal reach by comparing the effects of ablating high-centrality nodes versus random nodes in LLM latent spaces.
  • Empirical benchmarks indicate that higher CFS values robustly distinguish genuine causal influence from mere correlations.

The Causal Fidelity Score (CFS) is an intervention-based quantitative metric that evaluates the extent to which a learned Causal Concept Graph (CCG) over sparse latent representations in LLMs encodes genuine causal relationships, rather than mere statistical correlations. CFS assesses whether targeted ablations of high-centrality concepts—identified through the graph structure—produce more significant downstream changes in their children concepts than comparable ablations at randomly chosen nodes, thereby measuring the “causal reach” of the graph’s most influential components (Meherab et al., 11 Mar 2026).

1. Foundational Motivation

The core motivation for CFS arises from the interpretability challenges in learned concept graphs over LLM latent spaces. Sparse autoencoders localize interpretable features (“concepts”) but do not, in isolation, elucidate the mechanisms of conceptual interaction during multi-step reasoning. The directed acyclic graph learned in a CCG proposes edges as putative causal dependencies. However, mere edge presence does not guarantee genuine causal influence. CFS addresses this by formalizing an intervention protocol: an explicit experiment that compares the effect of ablating (zeroing the activation of) high-centrality nodes versus random nodes, evaluating the difference in downstream impact on their children concepts.

A plausible implication is that a consistently high CFS across tasks signals robust graph structure and non-trivial encoding of reasoning dynamics in the LLM’s latent space.

2. Formal Mathematical Definition

Let WRM×MW \in \mathbb{R}^{M \times M} denote the learned weighted adjacency matrix over the top MM discovered concepts. For node ii, its set of “children” is defined as

Di={j:Wij>0.01},D_i = \{j : W_{ij} > 0.01\},

where WijW_{ij} exceeds a fixed edge threshold.

The intervention on node ii is realized by ablating its activation: ci0c_i \leftarrow 0. The downstream effect Δi\Delta_i is computed as the average 1\ell_1 change in the children’s predicted activations:

Δi=1DijDi[CW]jci=0[CW]jorig1.\Delta_i = \frac{1}{|D_i|} \sum_{j \in D_i} \| [CW]_{\cdot j}\big|_{c_i=0} - [CW]_{\cdot j}\big|_{\text{orig}} \|_1.

To assess causal specificity, SS high-centrality nodes {ic(s)}\{i_c^{(s)}\} (chosen by out-degree) and SS random indices {ir(s)}\{i_r^{(s)}\} are selected (S=20S=20). For each s=1,,Ss=1, \dots, S, form the ratio:

rs=Δic(s)max(Δir(s), δ),r_s = \frac{\Delta_{i_c^{(s)}}}{\max(\Delta_{i_r^{(s)}},\ \delta)},

with δ=103\delta=10^{-3} to avoid zero division. Clipping is enforced with τ=10\tau=10:

rsmin(rs,τ).r_s \leftarrow \min(r_s, \tau).

The final score aggregates over SS samples:

CFS=1Ss=1Srs.\text{CFS} = \frac{1}{S} \sum_{s=1}^S r_s.

3. Computational Protocol

CFS is computed through a structured four-step process:

  1. Intervention Target Selection: Nodes are ranked by out-degree. The top S=20S=20 are chosen for graph-guided ablation. Independently, S=20S=20 nodes are sampled uniformly as a random baseline.
  2. Ablation and Response Measurement: For each chosen node ii, its activation is set to zero (ci=0c_i=0); all other concept activations are unchanged. The downstream prediction CWCW is recalculated, and the 1\ell_1-difference Δi\Delta_i is averaged over all children jDij\in D_i.
  3. Pairwise Ratio Computation: For each ss, calculate rsr_s as the ratio of Δic(s)\Delta_{i_c^{(s)}} to max(Δir(s),δ)\max(\Delta_{i_r^{(s)}}, \delta), then clip with τ=10\tau=10.
  4. Aggregation: Compute the mean value across all SS trials.

This methodology ensures both the statistical stability of the metric (via paired sampling) and its robustness to the graph’s sparsity or anomalies in the effect sizes.

4. Properties, Bounds, and Model Assumptions

CFS has several carefully engineered properties:

  • Chance-level Baseline: CFS approaches 1 if ablations at graph-selected nodes exert effects comparable to random ablations, i.e., when identified graph structures have low causal specificity.
  • Sensitivity to True Influence: CFS exceeds 1 when the graph’s “influential” nodes reliably induce outsized downstream disruptions relative to random nodes.
  • Sparsity-robust Lower Bound: The floor parameter δ\delta (10310^{-3}) ensures ratios are well-defined, even in highly sparse graphs where random nodes may have no children (Δir=0\Delta_{i_r}=0). The cap τ\tau (10) constrains heavy-tailed ratios, yielding a conservative lower bound on CFS.
  • Structural Model Constraint: The underlying structural equation model is linear (i.e., CCWC \approx CW), and the edge threshold Wij>0.01W_{ij}>0.01 governs graph sparsity. Extensions to nonlinear or multilayer SCMs are not explored in the original work.

This design provides a stable, interpretable metric for causal graph evaluation that is robust under model and data constraints.

5. Empirical Evaluation and Comparative Results

CFS was evaluated across three reasoning-focused NLP benchmarks—ARC-Challenge, StrategyQA, and LogiQA—using GPT-2 Medium. In each case, n=15n=15 paired runs (across five random seeds on 300 examples per task) compared CCGs, ROME-style tracing, feature-only ranking, and random baselines. Results are summarized below:

Benchmark CCG (CFS) ROME SAE-only Random
ARC-Challenge 5.729±0.8755.729 \pm 0.875 3.488±0.2033.488 \pm 0.203 2.552±0.1892.552 \pm 0.189 1.032±0.0341.032 \pm 0.034
StrategyQA 5.461±0.4055.461 \pm 0.405 3.205±0.1793.205 \pm 0.179 2.399±0.1702.399 \pm 0.170 1.032±0.0341.032 \pm 0.034
LogiQA 5.771±0.4315.771 \pm 0.431 3.452±0.2043.452 \pm 0.204 2.487±0.1962.487 \pm 0.196 1.032±0.0341.032 \pm 0.034
Avg. 5.654±0.6255.654 \pm 0.625 3.382±0.2333.382 \pm 0.233 2.479±0.1962.479 \pm 0.196 1.032±0.0341.032 \pm 0.034

Statistical testing (one-sided paired t-tests, Bonferroni corrected, n=15n=15) confirmed strong effect sizes:

  • CCG vs. ROME: t=14.32t=14.32, p<0.0001p<0.0001, Cohen’s d=4.82d=4.82
  • CCG vs. SAE-only: t=19.83t=19.83, p<0.0001p<0.0001, d=6.86d=6.86
  • CCG vs. Random: t=27.95t=27.95, p<0.0001p<0.0001, d=10.45d=10.45

These results indicate substantial, statistically significant performance improvement for CCGs over competitive baselines (Meherab et al., 11 Mar 2026).

6. Interpretation and Analytical Significance

A high CFS (≫1) signifies that the learned CCG is reliably identifying concepts where intervention (ablation) causes substantially greater downstream change to their children than random ablations. This is interpreted as evidence that the concept graph has learned non-trivial, causally salient structure, rather than capturing spurious statistical relations. Conversely, a CFS close to $1$ would indicate that graph-identified “influential” nodes are no more likely than random nodes to affect child concept activations, implying poor causal fidelity.

Because the CFS is explicitly clipped (at τ=10\tau=10) and floored (at δ=103\delta=10^{-3}), the metric is a conservative lower bound on the graph’s “true” causal reach; this is particularly relevant in sparse graphs, where many random nodes might otherwise yield undefined or inflated ratios. The results support the stability and domain-specificity of the learned CCGs, with reported sparse edge densities ($5$–6%6\%) and robustness across seeds.

7. Contextual and Methodological Implications

CFS contributes a principled evaluation mechanism for causal graph discovery in LLM latent spaces, emphasizing intervention over correlation as the key criterion. The reliance on do-calculus-style ablations grounds the metric in the interventionist tradition of causal inference. The conservative nature of its construction highlights its utility for reliable benchmarking in future work, including potential extension to nonlinear or multilayer SCMs.

Current limitations include the restriction to linear SEMs and the need for threshold selection for graph edges. Further work may investigate the adaptation of CFS to more complex structural motifs or richer types of node interventions.

CFS thus operationalizes causal evaluation for learned concept graphs and demonstrates, in the context of LLM reasoning benchmarks, that structure-informed intervention metrics provide rigorous, stable, and informative measures of latent causal dynamics (Meherab et al., 11 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Causal Fidelity Score (CFS).