CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

Published 5 Jun 2026 in cs.CL | (2606.06842v1)

Abstract: Table reasoning remains challenging for LLMs, particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limits their ability to explore alternative hypotheses across tasks. In this work, we propose CRAFT, a unified Counterfactual Reasoning Framework that reformulates Tabular question answering and fact verification into a general bidirectional verification process. Our method explicitly constructs both declarative statements and their counterfactual variants. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering. Our framework also significantly mitigates performance gaps between different backbone LLMs. This indicates that counterfactual reasoning effectively overcomes the limitations of single-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks. Our code will be made publicly available upon acceptance.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces CRAFT, a unified framework that integrates factual and counterfactual reasoning to recast tabular QA and fact verification as a statement verification task.
It employs a novel four-stage pipeline (Rewriter, Reverser, Extractor, Rethinker) that enhances evidence aggregation and yields significant accuracy improvements over traditional methods.
The framework achieves notable gains on benchmarks like WikiTQ and TabFact, correcting over 59% of initially wrong candidates and demonstrating robust cross-model performance.

CRAFT: Counterfactual Reasoning for Tabular QA and Fact Verification

Introduction and Motivation

The paper presents CRAFT, a unified framework for tabular question answering (QA) and fact verification (FV) that integrates explicit counterfactual reasoning with LLMs (2606.06842). Traditional LLM-based table reasoning systems primarily employ single-direction inference, often leading to suboptimal robustness, insufficient exploration of hypotheses, and saturation in performance gains with increased self-critique iterations. CRAFT is motivated by the hypothesis that explicit, bidirectional (factual and counterfactual) reasoning paths—as opposed to ensembling or repeated, unidirectional self-critique—can address fundamental weaknesses in tabular inference by promoting alternative scenario exploration and reinforcing robust evidence aggregation.

Framework Overview

CRAFT reorganizes both tabular QA and FV as a statement verification task and applies a coordinated, four-stage reasoning pipeline:

Rewriter: Reformulates the input question into a declarative, verifiable hypothesis, bridging the representational gap between QA and FV tasks and facilitating uniform downstream processing.
Reverser: Synthesizes a counterfactual statement by semantically inverting aspects of the hypothesis using rule-based reverse templates. Candidate counterfactuals are scored using proxy reasoning structures (e.g., SQL programs) to maximize information coverage.
Extractor: Executes reasoning traces for both the original and counterfactual statements to extract path-specific supporting evidence and candidate answers, evaluating the logical consistency of each evidence item in the table context.
Rethinker: Aggregates evidence and candidate answers from both reasoning paths, applies a weighted decision protocol with self-consistency scoring, and handles discordant cases via cross-evidence checks and fallback strategies.

This framework is explicitly model-agnostic: all components operate without modifying backbone model parameters and can integrate both open-source and proprietary LLMs.

Experimental Results

Datasets and Models

CRAFT is systematically evaluated on two prominent benchmarks:

TabFact: Table-based binary fact verification.
WikiTQ: Table-based open-ended QA with structured short answers.

Multiple LLM backbones are employed, including Llama3.3-70B, Qwen2.5-72B, Deepseek-R1-14B, GPT-5-mini, and Qwen3.5-27B, as well as standard and strong baselines such as Chain-of-Table and Table-Critic.

Performance Summary

CRAFT achieves strong, cross-model improvements:

Model	WikiTQ (Accuracy %)	TabFact (Accuracy %)
Strongest baseline (avg)	77.7	93.5
CRAFT (avg)	82.4	94.6

Llama3.3-70B sees a +11.3pt gain on WikiTQ over the next best baseline.
CRAFT reverses the usual backbone performance ranking: Llama3.3 exceeds Qwen2.5 when augmented by CRAFT, underscoring the generality and debiasing effect of counterfactual reasoning.

Ablation and Analysis

Rewriter-only: Substantial improvement in QA tasks, confirming the advantage of recasting questions as declarative statements.
Reverser-only: Performance is close to Rewriter, showing that counterfactual statements can independently recover evidence, but combining both yields the highest accuracy.
Control for repeated sampling/ensemble: Voting and self-consistency baselines with $N=3$ samples cannot match CRAFT, and Pass@K, even with larger $K$ , does not bridge the gap. Performance gains are attributed to semantic diversity in reasoning paths, not brute-forced candidate generation.
Multiple counterfactuals: Modest but consistent gains with more counterfactual paths, with diminishing returns and increased computational cost.
Self-critique compatibility: Additional self-critique further improves results, but performance plateaus for single-direction methods; bidirectional reasoning consistently outperforms even with more iterations.
Robustness to table size: Degradation in large-table scenarios is flatter for CRAFT, demonstrating strong scalability and context resilience compared to all baselines.

Error-Correction and Step-0 Analysis

CRAFT corrects approximately 59% of initially incorrect candidates, with over 28% of final correct predictions recovered solely via evidence integration across the factual and counterfactual paths.
The Rethinker module, combining structural and model-based constraints, achieves higher accuracy than prompt-only decision strategies.

Practical and Theoretical Implications

By explicitly constructing and reasoning over both factual and counterfactual statements, CRAFT:

Unifies QA and FV under a general statement-centric, bidirectional reasoning protocol, simplifying multi-task and multi-format systems.
Provides model-agnostic robustness: performance gains hold over open-source/closed-source, high/low-parameter, and instruction-tuned or vanilla architectures.
Mitigates LLM-specific variance by focusing on reasoning path diversity rather than model idiosyncrasies or parameter tuning.
Enhances error-correction/resilience via systematic exploration of alternative hypotheses, a key desideratum for trustworthy AI.
Suggests that inference directionality, not just answer diversity, drives robust reasoning—paradigm-shifting for prompt-based reasoning in structured domains.

The explicit introduction of a counterfactual pathway not only covers more semantic space but surfaces evidence otherwise unreachable in conventional, forward-only protocols.

Future Directions

CRAFT's modular counterfactual design has implications for broader AI reasoning tasks beyond tabular data:

Extending to non-tabular multi-hop QA, text-based verification, and multi-modal reasoning, though challenges remain in semantic definition and atomic fact extraction.
Scaling to lower-parameter models: future work may optimize evidence synthesis for models with limited contextual memory.
Enhanced path selection among multiple counterfactuals, automated template induction, and more sophisticated evidence weighting could further boost both efficiency and coverage.
The framework encourages integration of causal and hypothetical reasoning paradigms for more reliable, auditable, and efficient AI systems.

Conclusion

CRAFT introduces a principled, scalable solution for table understanding, leveraging explicit bidirectional (factual and counterfactual) reasoning. Empirical results highlight both substantial performance improvements and a marked reduction in model-specific biases. The framework provides a platform for further advances in robust, unified reasoning across structured and potentially unstructured domains, advancing the principled application of counterfactual analysis in LLM-driven AI systems.

Markdown Report Issue