Deep Research Bench II
- Deep Research Bench II is a comprehensive evaluation suite for information-seeking AI systems, emphasizing multi-stage web retrieval, analytic synthesis, and structured reporting.
- It uses expert-crafted rubric bundles and fine-grained binary checks to diagnose performance across diverse domains and ensure factual and analytic rigor.
- The benchmark highlights gaps in current deep research agents such as retrieval precision, analytic inference, and citation reliability, guiding future system improvements.
Deep Research Bench II (DRB-II) is a comprehensive evaluation suite for deep research agents—open-ended, information-seeking AI systems tasked with multi-stage web retrieval, synthesis, and report generation. Unlike single-answer QA benchmarks, DRB-II is built to surface the strengths and limitations of systems that must analyze, reason, and communicate through long-form, evidence-based outputs. The benchmark defines atomic, verifiable rubrics and a multidimensional scoring protocol aligned with human expert standards. DRB-II embodies two principal instantiations in the literature: one centered on expert-constructed rubric bundles and multidimensional metrics for research-agent reports, and another grounded in fine-grained, binary rubrics reverse-engineered from professional investigative documents to enable atomic scoring of information recall, analytic inference, and presentation. Collectively, DRB-II constitutes the most granular and methodologically rigorous framework for diagnosing, comparing, and driving progress in research agent architectures (Yao et al., 2 Oct 2025, &&&1&&&).
1. Benchmark Construction and Task Design
DRB-II comprises two major implementations in state-of-the-art literature. The first, “A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports” (Yao et al., 2 Oct 2025), centers on 214 expert-crafted, high-complexity query-report tasks. Tasks span ten domains: Academia & Research, News & Current Affairs, Sports & Competitions, Commonsense & Education, Law & Politics, Business & Finance, Technology Intelligence, Environment & Sustainability, History & Social Sciences, and Health & Medicine. Each entry undergoes three rounds of human review with interleaved LLM-based audit for factual consistency, rubric validity, and stylistic uniformity. Expert teams construct reference bundles for every task, specifying:
- Query-Specific Rubrics (QSRs): 8–15 binary/ternary checks per task, totaling 30 points, probing factual and causal completeness.
- General-Report Rubrics (GRRs): 48 general indicators (structure, clarity, logic, citation, etc.) totaling 73 points.
- Trustworthy-Source Links (TSLs): canonical URLs for citation verification.
- Focus-Anchor Keywords (FAKs) and Focus-Deviation Keywords (FDKs): keywords to monitor topical completeness and drift.
A parallel instantiation, “DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report” (Li et al., 13 Jan 2026), contains 132 open-ended tasks rooted in 22 domains, derived from peer-reviewed, open-access surveys and systematic reviews. For each, 9430 binary rubrics are automatically extracted and refined by a four-stage LLM plus human workflow: LLM rubric extraction from expert articles, iterative self-evaluation for hallucination trimming, annotator revision for atomicity, and domain-expert review (>400 human-hours). Tasks are phrased to require data retrieval, analytic synthesis, and structured communication, with explicit temporal constraints where applicable.
2. Evaluation Rubrics and Scoring Protocols
DRB-II implements a multidimensional composite rubric framework. In the first instantiation (Yao et al., 2 Oct 2025), evaluation spans:
- Semantic Quality (Quality): Aggregation of QSR and GRR scores, linearly combined by
where (default ).
- Topical Focus (SemanticDrift): Quantifies omission of FAKs and intrusion of FDKs:
The aggregated topical focus penalty is (default ).
- Retrieval Trustworthiness (TrustworthyBoost): Rewards correct citation of TSLs:
with , .
The final integrated evaluation is multiplicative:
In the rubric-extraction instantiation (Li et al., 13 Jan 2026), each agent-produced report for a task is judged against all binary rubrics, yielding:
Dimension-specific scores aggregate rubric satisfaction for Information Recall, Analysis, and Presentation. Domain-wise and overall averages are reported.
3. Task Domains, Reference Bundles, and Ground Truth
The two leading versions of DRB-II differ in breadth and task seeding. The first provides 214 tasks across ten core thematic domains (from quantitative sciences to policy, law, and medicine), each designed to maximize semantic diversity and real-world relevance, structured to elicit reports instead of brief answers (Yao et al., 2 Oct 2025). For every task, human experts develop comprehensive reference bundles, enabling checks for factual accuracy, coverage, report structure, and citation reliability.
The second implementation draws 132 tasks from 22 domains, using open-access, peer-reviewed surveys as gold-standard answer sources (Li et al., 13 Jan 2026). Each task’s rubrics directly map to these articles, providing atomic, verifiable judgments on system outputs. This approach ties task ground truth to expert-authored, citable documentation, ensuring evaluation stability and reproducibility.
| DRB-II Benchmark | #Tasks | #Domains | Rubric Depth | Gold Source Construction |
|---|---|---|---|---|
| (Yao et al., 2 Oct 2025) | 214 | 10 | QSR+GRR+FAK/FDK/TSL | Human/LLM hybrid, reference bundle |
| (Li et al., 13 Jan 2026) | 132 | 22 | 9430 binary | Peer-reviewed articles, LLM+human |
4. Experimental Results and Diagnostic Findings
Experimental results on DRB-II reveal persistent gaps between state-of-the-art deep research agents and human-expert performance.
In (Yao et al., 2 Oct 2025), thirteen models—including five specialized DRAs and multiple tool-augmented LLM baselines—were evaluated. The top-performing agents (e.g., Qwen-DeepResearch: 34.65, Sonar-DeepResearch: 33.47) outperformed GPT-5 and Claude 3.7 by margins of 5–15 points in IntegratedScore, confirming superiority in multi-stage reasoning, task decomposition, and structured reporting. However, even the best agents demonstrated weaknesses: frequent FDK intrusions (SemanticDrift ≈ 0.47), token inefficiency (e.g., o3-DeepResearch ~ 25K tokens per report), and inconsistent citation of authoritative sources.
Results from (Li et al., 13 Jan 2026) further highlight the challenge: the leading model, OpenAI-GPT-o3 Deep Research, satisfied only 45.4% of expert rubrics overall. Information Recall was the primary bottleneck (≤ 40%), with analysis in the 42–52% range, while Presentation exceeded 80%. This pattern indicates current agents excel at report formatting but struggle to discover and verify core facts or synthesize evidence into novel insight. Human–LLM rubric agreement, however, was high (over 90% ACC using Gemini-2.5-Pro as the judge), quantifying DRB-II’s reliability.
| Model | InfoRecall | Analysis | Presentation | Total |
|---|---|---|---|---|
| OpenAI-GPT-o3 Deep Research | 39.98 | 49.85 | 89.16 | 45.40 |
| Gemini-3-Pro Deep Research | 39.09 | 48.94 | 91.85 | 44.60 |
| Qwen3-Max Deep Research | 34.18 | 48.04 | 74.59 | 39.25 |
| Tongyi Deep Research | 22.95 | 35.89 | 86.13 | 29.89 |
A plausible implication is that current deep research architectures require advances in retrieval coverage, cross-source validation, and stepwise analytic reasoning to approach human reliability.
5. Methodological Innovations and Comparative Advantages
DRB-II improves significantly upon prior benchmarks by:
- Extending beyond short answers to require comprehensive, report-style outputs with explicitly defined rubrics for every evaluation dimension.
- Embedding composite rubrics spanning factual coverage, analytic synthesis, structural coherence, and citation correctness (QSR + GRR + FAK/FDK + TSL in (Yao et al., 2 Oct 2025)).
- Deriving atomic, human-interpretable rubrics directly from expert-authored documents (Li et al., 13 Jan 2026), ensuring that task requirements and evaluation criteria are precisely mapped to ground-truth evidentiary standards.
- Introducing robust evaluation schemes that penalize topical drift (via FDK/FAK tracking), inappropriate citations, and stylistic lapses.
- Implementing human–LLM agreement studies and cross-language/ANOVA stability checks to validate benchmark integrity.
This multidimensional, granular approach yields fine-grained, interpretable diagnostics on agent performance, identifying subtleties in information recall, higher-order inference, and result presentation.
6. Implications, Limitations, and Extensions
DRB-II exposes core limitations of current research agents: gaps in credible retrieval, limited causal or comparative analysis, lack of effective source triangulation, and failures in audience adaptation. Addressing these would require:
- Architectural improvements such as adaptive search-depth controls (to limit redundant retrieval once core FAK coverage is achieved), decomposition constraints for domain and linguistic alignment at every step, and prioritized citation planners to maximize TrustworthyBoost (Yao et al., 2 Oct 2025).
- Integration of symbolic reasoning or causal-graph layers to complement LLM outputs with verifiable stepwise analytic frameworks (Li et al., 13 Jan 2026).
- User-adaptive presentation mechanisms extending evaluation criteria to reflect target audience background, with implications for user modeling and agent memory.
Future directions include semi-automated rubric expansion for coverage beyond the current 214/132 tasks (while sustaining expert-level curation), support for multimodal queries (e.g., tables and figures), and development of open-source tools to operationalize DRB-II’s scoring pipeline (Yao et al., 2 Oct 2025). An important methodological note: to minimize “answer leakage,” open-source deployments should block access to direct answer documents during system evaluation (Li et al., 13 Jan 2026).
7. Research Significance and Future Outlook
Deep Research Bench II establishes a new standard for comprehensive, human-aligned evaluation of research agents. By entrenching atomic, expert-derived rubrics and multidimensional, penalizing scoring, DRB-II precisely diagnoses failure modes and benchmarks progress in system architecture. The persistent gap between automated agents and human-expert reliability highlights a rich research agenda in retrieval, reasoning, and adaptive presentation, for which DRB-II provides an extensible, reproducible testbed (Yao et al., 2 Oct 2025, Li et al., 13 Jan 2026).