Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep Research Bench II

Updated 17 March 2026
  • Deep Research Bench II is a comprehensive evaluation suite for information-seeking AI systems, emphasizing multi-stage web retrieval, analytic synthesis, and structured reporting.
  • It uses expert-crafted rubric bundles and fine-grained binary checks to diagnose performance across diverse domains and ensure factual and analytic rigor.
  • The benchmark highlights gaps in current deep research agents such as retrieval precision, analytic inference, and citation reliability, guiding future system improvements.

Deep Research Bench II (DRB-II) is a comprehensive evaluation suite for deep research agents—open-ended, information-seeking AI systems tasked with multi-stage web retrieval, synthesis, and report generation. Unlike single-answer QA benchmarks, DRB-II is built to surface the strengths and limitations of systems that must analyze, reason, and communicate through long-form, evidence-based outputs. The benchmark defines atomic, verifiable rubrics and a multidimensional scoring protocol aligned with human expert standards. DRB-II embodies two principal instantiations in the literature: one centered on expert-constructed rubric bundles and multidimensional metrics for research-agent reports, and another grounded in fine-grained, binary rubrics reverse-engineered from professional investigative documents to enable atomic scoring of information recall, analytic inference, and presentation. Collectively, DRB-II constitutes the most granular and methodologically rigorous framework for diagnosing, comparing, and driving progress in research agent architectures (Yao et al., 2 Oct 2025, &&&1&&&).

1. Benchmark Construction and Task Design

DRB-II comprises two major implementations in state-of-the-art literature. The first, “A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports” (Yao et al., 2 Oct 2025), centers on 214 expert-crafted, high-complexity query-report tasks. Tasks span ten domains: Academia & Research, News & Current Affairs, Sports & Competitions, Commonsense & Education, Law & Politics, Business & Finance, Technology Intelligence, Environment & Sustainability, History & Social Sciences, and Health & Medicine. Each entry undergoes three rounds of human review with interleaved LLM-based audit for factual consistency, rubric validity, and stylistic uniformity. Expert teams construct reference bundles for every task, specifying:

  • Query-Specific Rubrics (QSRs): 8–15 binary/ternary checks per task, totaling 30 points, probing factual and causal completeness.
  • General-Report Rubrics (GRRs): 48 general indicators (structure, clarity, logic, citation, etc.) totaling 73 points.
  • Trustworthy-Source Links (TSLs): canonical URLs for citation verification.
  • Focus-Anchor Keywords (FAKs) and Focus-Deviation Keywords (FDKs): keywords to monitor topical completeness and drift.

A parallel instantiation, “DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report” (Li et al., 13 Jan 2026), contains 132 open-ended tasks rooted in 22 domains, derived from peer-reviewed, open-access surveys and systematic reviews. For each, 9430 binary rubrics are automatically extracted and refined by a four-stage LLM plus human workflow: LLM rubric extraction from expert articles, iterative self-evaluation for hallucination trimming, annotator revision for atomicity, and domain-expert review (>400 human-hours). Tasks are phrased to require data retrieval, analytic synthesis, and structured communication, with explicit temporal constraints where applicable.

2. Evaluation Rubrics and Scoring Protocols

DRB-II implements a multidimensional composite rubric framework. In the first instantiation (Yao et al., 2 Oct 2025), evaluation spans:

  1. Semantic Quality (Quality): Aggregation of QSR and GRR scores, linearly combined by

Quality=αNRatio[Qsum]+βNRatio[Gsum]\mathrm{Quality} = \alpha\,\mathcal{N}_\mathrm{Ratio}[Q_{\mathrm{sum}}] + \beta\,\mathcal{N}_\mathrm{Ratio}[G_{\mathrm{sum}}]

where α+β=1\alpha+\beta=1 (default α=β=0.5\alpha=\beta=0.5).

  1. Topical Focus (SemanticDrift): Quantifies omission of FAKs and intrusion of FDKs:

FAK_Drift=11Kk=1Kmin(freq(k)ϵ+,1)×NRatio(rele(k))\mathrm{FAK\_Drift} = 1 - \frac{1}{K}\sum_{k=1}^K \min\left(\frac{\mathrm{freq}^{(k)}}{\epsilon_+},1\right) \times \mathcal{N}_\mathrm{Ratio}(\mathrm{rele}^{(k)})

FDK_Drift=1Ll=1Lmin(freq(l)ϵ,1)×NRatio(rele(l))\mathrm{FDK\_Drift} = \frac{1}{L}\sum_{l=1}^L \min\left(\frac{\mathrm{freq}^{(l)}}{\epsilon_-},1\right) \times \mathcal{N}_\mathrm{Ratio}(\mathrm{rele}^{(l)})

The aggregated topical focus penalty is SemanticDrift=λFAK_Drift+μFDK_Drift\mathrm{SemanticDrift} = \lambda\,\mathrm{FAK\_Drift} + \mu\,\mathrm{FDK\_Drift} (default λ=0.7,  μ=0.3\lambda=0.7,\;\mu=0.3).

  1. Retrieval Trustworthiness (TrustworthyBoost): Rewards correct citation of TSLs:

TrustworthyBoost=1+η[θRatefull_hit+κRatehost_hit]\mathrm{TrustworthyBoost} = 1 + \eta[\theta\,\mathrm{Rate}_{\mathrm{full\_hit}} + \kappa\,\mathrm{Rate}_{\mathrm{host\_hit}}]

with θ+κ=1\theta+\kappa=1, η=0.2\eta=0.2.

The final integrated evaluation is multiplicative:

IntegratedScore=Quality×(1SemanticDrift)×TrustworthyBoost×100[0,120]\mathrm{IntegratedScore} = \mathrm{Quality} \times (1-\mathrm{SemanticDrift}) \times \mathrm{TrustworthyBoost} \times 100 \in [0,120]

In the rubric-extraction instantiation (Li et al., 13 Jan 2026), each agent-produced report for a task is judged against all RR binary rubrics, yielding:

S=1Ri=1RriS = \frac{1}{R} \sum_{i=1}^{R} r_i

Dimension-specific scores aggregate rubric satisfaction for Information Recall, Analysis, and Presentation. Domain-wise and overall averages are reported.

3. Task Domains, Reference Bundles, and Ground Truth

The two leading versions of DRB-II differ in breadth and task seeding. The first provides 214 tasks across ten core thematic domains (from quantitative sciences to policy, law, and medicine), each designed to maximize semantic diversity and real-world relevance, structured to elicit reports instead of brief answers (Yao et al., 2 Oct 2025). For every task, human experts develop comprehensive reference bundles, enabling checks for factual accuracy, coverage, report structure, and citation reliability.

The second implementation draws 132 tasks from 22 domains, using open-access, peer-reviewed surveys as gold-standard answer sources (Li et al., 13 Jan 2026). Each task’s rubrics directly map to these articles, providing atomic, verifiable judgments on system outputs. This approach ties task ground truth to expert-authored, citable documentation, ensuring evaluation stability and reproducibility.

DRB-II Benchmark #Tasks #Domains Rubric Depth Gold Source Construction
(Yao et al., 2 Oct 2025) 214 10 QSR+GRR+FAK/FDK/TSL Human/LLM hybrid, reference bundle
(Li et al., 13 Jan 2026) 132 22 9430 binary Peer-reviewed articles, LLM+human

4. Experimental Results and Diagnostic Findings

Experimental results on DRB-II reveal persistent gaps between state-of-the-art deep research agents and human-expert performance.

In (Yao et al., 2 Oct 2025), thirteen models—including five specialized DRAs and multiple tool-augmented LLM baselines—were evaluated. The top-performing agents (e.g., Qwen-DeepResearch: 34.65, Sonar-DeepResearch: 33.47) outperformed GPT-5 and Claude 3.7 by margins of 5–15 points in IntegratedScore, confirming superiority in multi-stage reasoning, task decomposition, and structured reporting. However, even the best agents demonstrated weaknesses: frequent FDK intrusions (SemanticDrift ≈ 0.47), token inefficiency (e.g., o3-DeepResearch ~ 25K tokens per report), and inconsistent citation of authoritative sources.

Results from (Li et al., 13 Jan 2026) further highlight the challenge: the leading model, OpenAI-GPT-o3 Deep Research, satisfied only 45.4% of expert rubrics overall. Information Recall was the primary bottleneck (≤ 40%), with analysis in the 42–52% range, while Presentation exceeded 80%. This pattern indicates current agents excel at report formatting but struggle to discover and verify core facts or synthesize evidence into novel insight. Human–LLM rubric agreement, however, was high (over 90% ACC using Gemini-2.5-Pro as the judge), quantifying DRB-II’s reliability.

Model InfoRecall Analysis Presentation Total
OpenAI-GPT-o3 Deep Research 39.98 49.85 89.16 45.40
Gemini-3-Pro Deep Research 39.09 48.94 91.85 44.60
Qwen3-Max Deep Research 34.18 48.04 74.59 39.25
Tongyi Deep Research 22.95 35.89 86.13 29.89

A plausible implication is that current deep research architectures require advances in retrieval coverage, cross-source validation, and stepwise analytic reasoning to approach human reliability.

5. Methodological Innovations and Comparative Advantages

DRB-II improves significantly upon prior benchmarks by:

  • Extending beyond short answers to require comprehensive, report-style outputs with explicitly defined rubrics for every evaluation dimension.
  • Embedding composite rubrics spanning factual coverage, analytic synthesis, structural coherence, and citation correctness (QSR + GRR + FAK/FDK + TSL in (Yao et al., 2 Oct 2025)).
  • Deriving atomic, human-interpretable rubrics directly from expert-authored documents (Li et al., 13 Jan 2026), ensuring that task requirements and evaluation criteria are precisely mapped to ground-truth evidentiary standards.
  • Introducing robust evaluation schemes that penalize topical drift (via FDK/FAK tracking), inappropriate citations, and stylistic lapses.
  • Implementing human–LLM agreement studies and cross-language/ANOVA stability checks to validate benchmark integrity.

This multidimensional, granular approach yields fine-grained, interpretable diagnostics on agent performance, identifying subtleties in information recall, higher-order inference, and result presentation.

6. Implications, Limitations, and Extensions

DRB-II exposes core limitations of current research agents: gaps in credible retrieval, limited causal or comparative analysis, lack of effective source triangulation, and failures in audience adaptation. Addressing these would require:

  • Architectural improvements such as adaptive search-depth controls (to limit redundant retrieval once core FAK coverage is achieved), decomposition constraints for domain and linguistic alignment at every step, and prioritized citation planners to maximize TrustworthyBoost (Yao et al., 2 Oct 2025).
  • Integration of symbolic reasoning or causal-graph layers to complement LLM outputs with verifiable stepwise analytic frameworks (Li et al., 13 Jan 2026).
  • User-adaptive presentation mechanisms extending evaluation criteria to reflect target audience background, with implications for user modeling and agent memory.

Future directions include semi-automated rubric expansion for coverage beyond the current 214/132 tasks (while sustaining expert-level curation), support for multimodal queries (e.g., tables and figures), and development of open-source tools to operationalize DRB-II’s scoring pipeline (Yao et al., 2 Oct 2025). An important methodological note: to minimize “answer leakage,” open-source deployments should block access to direct answer documents during system evaluation (Li et al., 13 Jan 2026).

7. Research Significance and Future Outlook

Deep Research Bench II establishes a new standard for comprehensive, human-aligned evaluation of research agents. By entrenching atomic, expert-derived rubrics and multidimensional, penalizing scoring, DRB-II precisely diagnoses failure modes and benchmarks progress in system architecture. The persistent gap between automated agents and human-expert reliability highlights a rich research agenda in retrieval, reasoning, and adaptive presentation, for which DRB-II provides an extensible, reproducible testbed (Yao et al., 2 Oct 2025, Li et al., 13 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Research Bench II (DRB-II).