FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning

Published 12 Apr 2026 in cs.AI | (2604.10693v1)

Abstract: Chain-of-Thought (CoT) prompting has improved LLM reasoning, but models often generate explanations that appear coherent while containing unfaithful intermediate steps. Existing self-evaluation approaches are prone to inherent biases: the model may confidently endorse coherence even when the step-to-step implication is not valid, leading to unreliable faithfulness evaluation. We propose FACT-E, a causality-inspired framework for evaluating CoT quality. FACT-E uses controlled perturbations as an instrumental signal to separate genuine step-to-step dependence from bias-driven artifacts, producing more reliable faithfulness estimates (\textit{intra-chain faithfulness}). To select trustworthy trajectories, FACT-E jointly considers \textit{intra-chain faithfulness} and \textit{CoT-to-answer consistency}, ensuring that selected chains are both faithful internally and supportive of the correct final answer. Experiments on GSM8K, MATH, and CommonsenseQA show that FACT-E improves reasoning-trajectory selection and yields stronger in-context learning exemplars. FACT-E also reliably detects flawed reasoning under noisy conditions, providing a robust metric for trustworthy LLM reasoning.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces FACT-E as a causality-inspired framework that uses structural causal models to assess both chain-of-thought fidelity and answer consistency.
It employs contrastive comparisons with perturbed reasoning chains to isolate genuine logical dependencies and mitigate inherent self-assessment biases.
Empirical evaluations on benchmarks like GSM8K and MATH-500 show FACT-E reliably improves reasoning accuracy while effectively filtering noisy exemplars.

FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning

Motivation and Problem Setting

Chain-of-Thought (CoT) prompting has substantially advanced LLM reasoning by eliciting explicit intermediate rationales. However, a persistent challenge is that LLMs generate stepwise reasoning chains that, while fluent, often harbor intra-chain logical failures or unnecessary steps that do not support subsequent inference. Conventional self-assessment methods, which typically apply the LLM as a judge (e.g., self-reflect), are susceptible to internal biases, including self-affirmation and reliance on statistical shortcuts, leading to unreliable faithfulness evaluations. These methods tend to favor coherence and final-answer correctness without robust scrutiny of intermediate logical dependencies.

Figure 1: Illustration of LLM self-assessment limitations: Conventional methods assign similarly high scores to both reasoning chains, missing logical breakdowns, whereas FACT-E accurately identifies the faithful chain.

Methodology: Causal Framework for CoT Evaluation

The FACT-E (Faithfulness and Consistency Tandem Estimation) framework proposes a causality-driven approach to CoT quality estimation. The process leverages Structural Causal Models (SCMs) to disentangle genuine step-to-step dependency from artifacts induced by LLM biases. Specifically:

CoT-to-Answer Consistency

CoT-to-Answer Consistency quantifies whether a reasoning chain leads to the correct outcome, modeled as the causal path $Q \rightarrow \mathbf{S} \rightarrow A$ , with $\mathbf{S}$ mediating the effect of $Q$ on final answer $A$ . The consistency score is computed via multiple LLM judgment samples, reflecting the probability that $\mathbf{S}$ is sufficient for $A$ .

Intra-Chain Faithfulness

Intra-Chain Faithfulness measures causal integrity between successive CoT segments, partitioning a chain at all intermediate steps and evaluating whether each prefix logically supports its suffix. Conventional LLM self-assessment is confounded by the LLM’s internal bias $Z$ , as formalized in the causal graph.

Figure 2: Structural causal graphs for CoT quality estimation. (a) The mediated reasoning pathway; (b) self-assessment with confounding bias; (c) FACT-E introduces instrumental noise to mitigate bias.

To overcome these confounding effects, FACT-E introduces external perturbations (noise) as an instrumental variable to create counterfactual continuations for each split. By contrastively comparing model preference for original versus perturbed chains (i.e., measuring the Average Causal Effect), the method isolates the genuine logical dependency. This contrastive design ensures stylistic parity and eliminates shortcut reliance, as both chains are self-generated and differ only in the injected logical corruption.

Algorithmic Implementation

FACT-E operates in two modes: standard (all steps checked) and lightweight (random checkpoints), striking a balance between computational overhead and evaluation granularity. For each CoT candidate, FACT-E first computes answer consistency; candidates with zero score are discarded. For remaining candidates, intra-chain faithfulness is measured at multiple split points via contrastive comparisons with counterfactual, perturbed continuations. The final reliability score $\mathcal{R}_{\mathbf{S}} = \mathcal{F}_{\mathbf{S}} \times \mathcal{C}_{\mathbf{S}}$ integrates both dimensions—only reasoning chains exhibiting both faithful intermediate dependencies and outcome-supporting logic are selected as trustworthy.

Empirical Evaluation

FACT-E is evaluated across GSM8K, MATH-500, and CommonsenseQA, as well as noisy-rationale benchmarks including NoRa-Math and NoRa-Commonsense. Experimental setup assesses three aspects: answer accuracy improvement by selection, in-context learning (ICL) with curated exemplars, and noise detection under adversarial conditions.

Figure 3: FACT-E robustness on noisy-rationale benchmarks: Top accuracy across varying numbers of noisy demonstrations in context.

Empirical results demonstrate that FACT-E, both standard and lightweight, consistently outperforms all baselines—self-reflection (Reflect), iterative refinement (Polish), self-denoising (Denoise), and self-consistency (Consistency)—for all LLM backbones and benchmarks. Notably, FACT-E yields up to 5.34% absolute gains on MATH-500 (ChatGPT) and exhibits high stability across models, suggesting insensitivity to backbone scale and architecture. The method also reliably identifies and filters noisy exemplars, preserving performance even with substantial rationale corruption.

Figure 4: FACT-E performance as function of sampling trials ( $N$ ); accuracy saturates after three iterations, indicating efficient convergence.

Case studies further substantiate FACT-E’s discriminative power: flawed reasoning chains (e.g., incorrect trigonometric application) receive low scores even when superficially fluent, while rigorous chains are identified and scored near 1.0, demonstrating granularity beyond mere answer correctness.

Ablation and Scalability Analysis

Ablative analysis confirms that joint use of intra-chain faithfulness and answer consistency is essential; individually, each dimension is insufficient for reliable selection. The lightweight variant maintains competitive performance with significantly lower inference requests, scaling linearly in $N$ checkpoints rather than chain length.

Figure 5: Ablation study reinforces necessity of both faithfulness and consistency modules in FACT-E.

Analysis across difficulty levels in MATH-500 reveals that FACT-E is robust against "performance cliffs," sustaining accuracy as logical complexity increases, unlike baseline methods that degrade sharply. $Figure 6$

Figure 6: Robust accuracy decay across levels of MATH-500 for FACT-E compared to baselines.

Practical and Theoretical Implications

Bias-mitigation: FACT-E rigorously addresses self-affirmation and shortcut bias via explicit causal intervention, circumventing the closed-loop pitfalls of black-box self-assessment.
Granular Quality Control: The framework enables fine-grained evaluation of reasoning traces, crucial for high-stakes applications where explanation reliability is paramount (e.g., mathematical and scientific domains).
Enhanced Prompt Curation: FACT-E-selected exemplars demonstrably improve in-context learning, supporting scalable transfer to new queries and enhancing few-shot performance.
Noise Robustness: FACT-E's stress-test approach provides principled denoising for prompting, useful in scenarios with uncertain, adversarial, or mixed-quality exemplars.

Future Directions

Future work may extend FACT-E to compositional reasoning contexts (e.g., Graph-of-Thought, multi-turn debates [Besta_Blach_Kubicek_Gerstenberger_Podstawski_Gianinazzi_Gajda_Lehmann_Niewiadomski_Nyczyk_Hoefler_2024]), integrate cross-model verification for inter-model bias isolation [xiong-etal-2023-examining], and optimize perturbation design for higher-order causal relations. Scalable variants can be explored for integrating FACT-E into real-time LLM reasoning pipelines.

Conclusion

FACT-E provides a principled framework for trustworthy CoT evaluation by unifying causal faithfulness assessment with outcome consistency. By systematically injecting controlled perturbations and quantifying the causal effect, FACT-E achieves fine-grained, bias-resilient selection of reasoning chains suitable for downstream use. The approach is shown to be robust, scalable, and effective across diverse models and reasoning tasks, establishing a new standard for CoT quality control in LLMs (2604.10693).

Markdown Report Issue