DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Published 6 Mar 2026 in cs.AI | (2603.05912v1)

Abstract: Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces the AtS protocol that dynamically updates factuality benchmarks using agentic verifiers to provide evidence-based adjudication.
The methodology demonstrates a rise in expert accuracy from 60.8% to 90.9% and validates the scalability of agent-only auditing for deep research synthesis.
Comparative results show that DeepFact-Eval outperforms snippet-based pipelines by achieving up to 87.2% accuracy and strong cross-dataset transfer.

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Motivation and Problem Context

LLMs augmented with search capabilities have attained significant proficiency at generating Deep Research Reports (DRRs), which require high-level synthesis and cross-document reasoning. Evaluating the factuality of such DRRs has emerged as a core bottleneck for practical deployment in scientific and engineering applications. Prevailing fact-checkers and evaluation benchmarks are constructed around short, factoid-style claims, and often fall short in handling long-form, multi-hop research synthesis. The reliability of these static, expert-annotated benchmarks has not been rigorously validated for cognitively intensive tasks, resulting in potential miscalibration of factuality metrics.

Failure Modes of Static Benchmarking

The standard paradigm treats human expert annotations as authoritative “gold labels” against which verifiers are scored. The authors demonstrate in a controlled PhD-level study that this assumption is brittle: even domain-matched experts, under realistic cognitive load, achieve only 60.8% accuracy on micro-gold “known answer” claims (i.e., adversarially designed supported and unsupported statements) embedded within DRR annotation tasks. This finding exposes fundamental flaws in both static benchmarking and the perceived infallibility of expert annotation for tasks requiring deep, context-rich reasoning. Expertise is highly fragmented; multi-expert adjudication is impractical at DRR scale due to scarcity, cost, and domain drift. This unreliability directly undermines the foundation of current factuality evaluation.

Audit-then-Score (AtS): Dynamic Benchmarking Protocol

To address the limits of static benchmarks, the paper introduces Evolving Benchmarking via Audit-then-Score (AtS). AtS operationalizes benchmarking as a co-evolving protocol: agentic verifiers (“Challengers”) challenge existing labels by submitting evidential proposals when a discrepancy is observed; a designated Auditor (human expert or trusted agent) adjudicates these proposals based on the quality and completeness of presented evidence. Successful challenges update the consensus ground truth, and all models are scored against this refined, versioned benchmark.

Figure 1: AtS workflow: Unlike static benchmarks, AtS dynamically updates ground truth by adjudicating disagreements between verifiers and current labels.

This architecture allows for longitudinal improvement in data quality, rationales, and decisional traceability. Empirically, human expert micro-gold accuracy rises monotonically to 90.9% across four AtS rounds, confirming that experts are far more reliable in the role of auditors—when allowed to reflect and adjudicate in the presence of model-generated evidence—than as one-shot labelers.

The DeepFact Suite: Benchmark and Evaluation Agent

The protocol is instantiated concretely as DeepFact-Bench and DeepFact-Eval. DeepFact-Bench is a continually versioned, claim-level factuality benchmark constructed from 944 DRR claims sampled from six high-complexity domains. Each entry includes the original context, current label, and a structured rationale. Its test split comprises 621 claims, including a substantial adversarial micro-gold subset to monitor label drift and annotation quality longitudinally.

DeepFact-Eval, a document-level agentic verifier, is developed with two regimes: an expert-level variant for maximum accuracy, and a “lite” variant that optimizes compute by grouping semantically similar claims for joint evidence retrieval (minimizing redundant queries). Unlike snippet-based pipelines (e.g., SAFE, FactCheck-GPT) or general deep-research agents, DeepFact-Eval interleaves breadth-oriented retrieval with depth-oriented, targeted evidence synthesis and cross-document reasoning.

Figure 2: Side-by-side: traditional snippet-based fact-checker pipeline versus DeepFact-Eval’s multi-step retrieval and in-depth reasoning workflow.

Empirical Evaluation

AtS protocol refinement is validated with both human and agent auditors. Critical findings include:

Audited expert performance surpasses unaided expert annotation: Micro-gold accuracy rises from 60.8% (static) to 90.9% (AtS Round 3) as agents introduce stronger evidence. Experts often learn from correct agent suggestions, and rarely regress by blindly accepting incorrect agent verdicts.
Agent-only auditing is viable for future scalability: Agents, when used as auditors instead of humans, reliably consolidate improvements in the benchmark, and cross-auditing among agents outperforms the solo baselines of each participant in both weaker $\rightarrow$ stronger and stronger $\rightarrow$ weaker pairings.
Figure 3: Micro-gold accuracy evolution across AtS auditing rounds. Auditing produces monotonic gains in expert labeling reliability.

Figure 4: Auditing performance for agent-only, showing that agent auditors reliably improve consensus regardless of the baseline strength of the agent being audited.

Ablations further show that auditing frequency (how many conflicts are actually adjudicated) modulates the speed of quality improvement: near-complete auditing rapidly saturates accuracy, while lower rates require more rounds for convergence. Strictness in revision gating provides only minor marginal benefit, with a potential trade-off in suppressing beneficial corrections.

Comparative Results and Generalization

DeepFact-Eval (GPT-4.1) attains 83.4% accuracy on DeepFact-Bench, a 27.5-point margin over snippet-based pipelines (e.g., SAFE, FactCheck-GPT) and a 14.3-point margin over GPT-Researcher—the strongest prior deep-research agent baseline. The “lite” variant maintains strong accuracy (76.4–77.9%) while reducing cost by 3–5 $\times$ .

Upgrading to GPT-5 as the backbone provides additional gains (87.2%). Notably, DeepFact-Eval exhibits robust cross-benchmark transfer: on SciFact and Factcheck-Bench, many apparent disagreements reflect original annotation divergence, claim ambiguity, or non-verifiable statements rather than systematic model error. After controlled re-audits of disagreement cases, recalibrated accuracy rises to 94.7% (SciFact) and 93.0% (Factcheck-Bench).

Figure 5: Audit dissects DeepFact-Eval disagreement cases on external datasets, showing that most residual errors are due to annotation discrepancies, not true model failures.

Practical Implications, Limitations, and Future Directions

The DeepFact paradigm—the co-evolution of models and benchmarks—addresses an acute need for evaluation in domains where expertise scarcity and cognitive load render traditional gold-label datasets inadequate. The AtS protocol is cost-efficient and amortizes the overhead of deep expert annotation, as the majority of adjudication and improvement occurs in the early rounds, with diminishing human effort in later rounds.

Practically, DeepFact-Bench and DeepFact-Eval instantiate a scalable, transparent infrastructure for ongoing “living” evaluation, necessary as LLMs reach or surpass expert-level literature reasoning capabilities. This lays the foundation for fully autonomous, continuously self-improving benchmarks driven by agent–agent auditing, with periodic expert involvement to ensure calibration.

However, DeepFact-Eval is fundamentally constrained to verification against the extant literature. It is not equipped to conduct original experimental science, and the computational burden of thorough, context-rich verification remains substantial. Real-time or low-latency deployment in decision-critical applications (e.g., clinical, legal, regulatory) is currently limited by this overhead.

Conclusion

DeepFact demonstrates that the factuality bottleneck for research-centric LLM outputs is not due to inherent limitations in modeling, but to the brittleness of static benchmarks and expert annotation workflows. The AtS protocol enables dynamic, evidence-driven evaluation where both agentic verifiers and benchmark labels iteratively improve in a traceable, versioned process. DeepFact-Bench and DeepFact-Eval establish new state-of-the-art accuracy on DRR factuality evaluation, with strong evidence for transfer, modular maintainability, and extensibility toward fully autonomous benchmarking ecosystems.

Markdown Report Issue