- The paper introduces CRAFT, a unified framework that integrates factual and counterfactual reasoning to recast tabular QA and fact verification as a statement verification task.
- It employs a novel four-stage pipeline (Rewriter, Reverser, Extractor, Rethinker) that enhances evidence aggregation and yields significant accuracy improvements over traditional methods.
- The framework achieves notable gains on benchmarks like WikiTQ and TabFact, correcting over 59% of initially wrong candidates and demonstrating robust cross-model performance.
CRAFT: Counterfactual Reasoning for Tabular QA and Fact Verification
Introduction and Motivation
The paper presents CRAFT, a unified framework for tabular question answering (QA) and fact verification (FV) that integrates explicit counterfactual reasoning with LLMs (2606.06842). Traditional LLM-based table reasoning systems primarily employ single-direction inference, often leading to suboptimal robustness, insufficient exploration of hypotheses, and saturation in performance gains with increased self-critique iterations. CRAFT is motivated by the hypothesis that explicit, bidirectional (factual and counterfactual) reasoning paths—as opposed to ensembling or repeated, unidirectional self-critique—can address fundamental weaknesses in tabular inference by promoting alternative scenario exploration and reinforcing robust evidence aggregation.
Framework Overview
CRAFT reorganizes both tabular QA and FV as a statement verification task and applies a coordinated, four-stage reasoning pipeline:
- Rewriter: Reformulates the input question into a declarative, verifiable hypothesis, bridging the representational gap between QA and FV tasks and facilitating uniform downstream processing.
- Reverser: Synthesizes a counterfactual statement by semantically inverting aspects of the hypothesis using rule-based reverse templates. Candidate counterfactuals are scored using proxy reasoning structures (e.g., SQL programs) to maximize information coverage.
- Extractor: Executes reasoning traces for both the original and counterfactual statements to extract path-specific supporting evidence and candidate answers, evaluating the logical consistency of each evidence item in the table context.
- Rethinker: Aggregates evidence and candidate answers from both reasoning paths, applies a weighted decision protocol with self-consistency scoring, and handles discordant cases via cross-evidence checks and fallback strategies.
This framework is explicitly model-agnostic: all components operate without modifying backbone model parameters and can integrate both open-source and proprietary LLMs.
Experimental Results
Datasets and Models
CRAFT is systematically evaluated on two prominent benchmarks:
- TabFact: Table-based binary fact verification.
- WikiTQ: Table-based open-ended QA with structured short answers.
Multiple LLM backbones are employed, including Llama3.3-70B, Qwen2.5-72B, Deepseek-R1-14B, GPT-5-mini, and Qwen3.5-27B, as well as standard and strong baselines such as Chain-of-Table and Table-Critic.
CRAFT achieves strong, cross-model improvements:
| Model |
WikiTQ (Accuracy %) |
TabFact (Accuracy %) |
| Strongest baseline (avg) |
77.7 |
93.5 |
| CRAFT (avg) |
82.4 |
94.6 |
- Llama3.3-70B sees a +11.3pt gain on WikiTQ over the next best baseline.
- CRAFT reverses the usual backbone performance ranking: Llama3.3 exceeds Qwen2.5 when augmented by CRAFT, underscoring the generality and debiasing effect of counterfactual reasoning.
Ablation and Analysis
- Rewriter-only: Substantial improvement in QA tasks, confirming the advantage of recasting questions as declarative statements.
- Reverser-only: Performance is close to Rewriter, showing that counterfactual statements can independently recover evidence, but combining both yields the highest accuracy.
- Control for repeated sampling/ensemble: Voting and self-consistency baselines with N=3 samples cannot match CRAFT, and Pass@K, even with larger K, does not bridge the gap. Performance gains are attributed to semantic diversity in reasoning paths, not brute-forced candidate generation.
- Multiple counterfactuals: Modest but consistent gains with more counterfactual paths, with diminishing returns and increased computational cost.
- Self-critique compatibility: Additional self-critique further improves results, but performance plateaus for single-direction methods; bidirectional reasoning consistently outperforms even with more iterations.
- Robustness to table size: Degradation in large-table scenarios is flatter for CRAFT, demonstrating strong scalability and context resilience compared to all baselines.
Error-Correction and Step-0 Analysis
- CRAFT corrects approximately 59% of initially incorrect candidates, with over 28% of final correct predictions recovered solely via evidence integration across the factual and counterfactual paths.
- The Rethinker module, combining structural and model-based constraints, achieves higher accuracy than prompt-only decision strategies.
Practical and Theoretical Implications
By explicitly constructing and reasoning over both factual and counterfactual statements, CRAFT:
- Unifies QA and FV under a general statement-centric, bidirectional reasoning protocol, simplifying multi-task and multi-format systems.
- Provides model-agnostic robustness: performance gains hold over open-source/closed-source, high/low-parameter, and instruction-tuned or vanilla architectures.
- Mitigates LLM-specific variance by focusing on reasoning path diversity rather than model idiosyncrasies or parameter tuning.
- Enhances error-correction/resilience via systematic exploration of alternative hypotheses, a key desideratum for trustworthy AI.
- Suggests that inference directionality, not just answer diversity, drives robust reasoning—paradigm-shifting for prompt-based reasoning in structured domains.
The explicit introduction of a counterfactual pathway not only covers more semantic space but surfaces evidence otherwise unreachable in conventional, forward-only protocols.
Future Directions
CRAFT's modular counterfactual design has implications for broader AI reasoning tasks beyond tabular data:
- Extending to non-tabular multi-hop QA, text-based verification, and multi-modal reasoning, though challenges remain in semantic definition and atomic fact extraction.
- Scaling to lower-parameter models: future work may optimize evidence synthesis for models with limited contextual memory.
- Enhanced path selection among multiple counterfactuals, automated template induction, and more sophisticated evidence weighting could further boost both efficiency and coverage.
- The framework encourages integration of causal and hypothetical reasoning paradigms for more reliable, auditable, and efficient AI systems.
Conclusion
CRAFT introduces a principled, scalable solution for table understanding, leveraging explicit bidirectional (factual and counterfactual) reasoning. Empirical results highlight both substantial performance improvements and a marked reduction in model-specific biases. The framework provides a platform for further advances in robust, unified reasoning across structured and potentially unstructured domains, advancing the principled application of counterfactual analysis in LLM-driven AI systems.