Verifying complex multi-hop scientific claims in deep research reports

Develop reliable procedures to verify the factuality of complex, multi-hop scientific claims within deep research reports produced by search-based agentic large language models, ensuring claim-level judgments can be made accurately in this expert-level, long-context setting.

Background

The paper studies factuality verification for deep research reports (DRRs) generated by search-based agentic LLMs, where claims often require multi-hop reasoning across extensive technical literature. The authors argue that common verifier workflows centered on snippet-level matching are insufficient for DRRs, and that reliable verification must operate over full documents and cross-check the broader literature rather than only in-report citations.

They further show that static expert-labeled benchmarks are brittle in this domain: in a controlled study, unassisted PhD-level specialists achieved only 60.8% accuracy on hidden micro-gold claims, motivating their Audit-then-Score (AtS) paradigm and the DeepFact-Bench/DeepFact-Eval artifacts. The open challenge identified in the introduction frames the broader need for dependable methods that can verify complex, multi-hop scientific claims at the claim level within DRRs.

References

However, verifying these complex, multi-hop scientific claims remains an open challenge.

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality  (2603.05912 - Huang et al., 6 Mar 2026) in Section 1 (Introduction), first paragraph