DeepResearch Benchmark

Updated 14 April 2026

DeepResearch Benchmark is an evolving suite of evaluation frameworks that rigorously measure autonomous, multi-step information retrieval, synthesis, and report generation.
It features a two-stage task structure where agents first synthesize key information and then generate detailed long-form reports from heterogeneous and multimodal sources.
The benchmarks employ granular evaluation protocols and diverse metrics to expose actionable failure modes and guide improvements in retrieval, reasoning, and report quality.

DeepResearch benchmarks comprise a rapidly evolving suite of evaluation frameworks designed to rigorously measure the capabilities of agentic systems that autonomously conduct complex, multi-step information retrieval, synthesis, and report generation. These benchmarks emphasize user-centric, open-ended, and real-world tasks, objective and granular evaluation protocols, and explicit separation of reasoning, retrieval, and presentation competencies. Spanning both general and scientific domains—including web, enterprise, private heterogeneous, and multimodal sources—contemporary DeepResearch benchmarks expose actionable failure modes and enable reproducible, human-aligned comparison across agent architectures and toolchains.

1. Formal Definition and Task Structure

DeepResearch tasks are characterized by high levels of search and reasoning intensity, requiring an agent to issue tens of queries, process diverse information units, and synthesize atomic claims or findings into structured outputs. Recent benchmarks formalize a two-stage abstraction (Java et al., 6 Aug 2025):

Subtask 1: Information Synthesis — Given corpus $\mathcal{C}$ and query $q$ , the system must discover a set of key claims $\mathcal{A} = [A_1, A_2, ..., A_m]$ , often structured as nested dictionaries of sub-claims.
Subtask 2: Report Generation — Conditioned on $\mathcal{A}$ , generate the long-form report.

A query qualifies as "deep research" if it requires processing many information units and at least some aspect of its search or synthesis demands non-trivial reasoning. The workflow can be formalized as a directed acyclic graph over the tuple $(q, \text{retrieved files}, \text{claims})$ , with edges mapping search or aggregation actions.

Benchmarks such as LiveDRBench (Java et al., 6 Aug 2025), DeepResearch Bench (Du et al., 13 Jun 2025), and DRBench (Abaskohi et al., 30 Sep 2025) operationalize this framework across open-domain, enterprise, and scientific contexts. Tasks vary from requiring exhaustive synthesis (e.g., literature reviews, technical surveys, market analyses), to cross-modal and multimodal research (e.g., scientific data, images, enterprise documents).

2. Benchmark Construction and Domains

Construction pipelines are uniformly multi-stage, using both large-scale data curation and expert design:

Query Sourcing: Queries are sampled from real user logs, expert analysis, synthetic enterprise workflows, or diverse academic fields to ensure coverage and difficulty (Java et al., 6 Aug 2025, Wu et al., 1 Mar 2026, Abaskohi et al., 30 Sep 2025, Guo et al., 30 Nov 2025).
Corpus Design: Corpora are either fixed and frozen (to ensure reproducibility; e.g., ClueWeb22, FineWeb (Coelho et al., 25 May 2025), BrowseComp-Plus (Chen et al., 8 Aug 2025)), private/multimodal (IoDResearch (Shi et al., 2 Oct 2025)), or synthetic but cross-linked (HERB (Choubey et al., 29 Jun 2025)).
Gold Annotations: Each task is paired with ground-truth intermediate structures (claim lists, insight sets), reference reports, or detailed diagnostic rubrics extracted either semi-automatically or via intensive human curation (Han et al., 19 Dec 2025, Li et al., 13 Jan 2026).

Coverage spans both general (science, business, law, health, technology) and domain-specific (enterprise, scientific, multimodal, Chinese, private data) regimes, as summarized below:

Benchmark	#Tasks	Supported Domains	Output
DeepResearch Bench	100	22 (PhD-level, EN/ZH)	Long report
LiveDRBench	100	Science, Public Events	Claims/graph
BrowseComp-Plus	830	Broad (fixed corpus)	Multi-step QA
IoDResearch	200+	Law, Sci, Private Data	QA, Reports
HERB (Enterprise)	800+	Synthetic Enterprise	Multi-hop QA
DeepResearch-9K	9000	Multi-hop Open-Web QA	Trace + Answer
DEER	50	13 (expert domains)	Expert report

3. Evaluation Protocols and Metrics

DeepResearch benchmarks universally adopt formal, multi-axis evaluation protocols to separate retrieval, reasoning, and report quality:

Claim-level evaluation: Structured JSON outputs for synthesized claims; claim-precision/recall/F1 computed via key-based alignment (Java et al., 6 Aug 2025).
Rubric-based atomicity: Diagnostic rubrics (dozens per task) directly derived from expert reports (e.g., 9,430 in DeepResearch Bench II (Li et al., 13 Jan 2026); 130 in DEER (Han et al., 19 Dec 2025)); rubrics cover information recall, reasoning/analysis, and presentation.
Report quality: Composite metrics such as RACE (reference-based, adaptive, weighted criteria) and holistic LLM-as-judge schemes scoring coverage, depth, instruction-following, and clarity (Du et al., 13 Jun 2025, Fan et al., 9 Oct 2025).
Retrieval and citation accuracy: Document retrieval metrics (Recall@K, Precision@K, MRR, nDCG@K), citation recall/precision, and explicit scoring of faithfulness (fraction of claims correctly supported by cited sources) (Coelho et al., 25 May 2025, Chen et al., 8 Aug 2025).
Checklist and content coverage: Coverage defined as fraction of human-constructed unit items satisfied (Wang et al., 16 Oct 2025).
Trace and plan analysis: Branching factor, number of backtracks, and referenced sources for agent trajectories, measuring efficiency and exploration depth (Java et al., 6 Aug 2025, Wu et al., 1 Mar 2026).

Metrics are predominantly (1) binary satisfaction or precision/recall for atomic rubrics/claims; (2) weighted aggregation for rubric-based or reference-based scoring; (3) LLM-judge alignment via agreement statistics (κ, MAD, etc.) where applicable.

4. Empirical Findings and Benchmark Insights

Experimental results consistently reveal a substantive gap between frontier agentic LLM systems and human expert outputs:

Coverage bottleneck: Key point recall (KPR) remains the hardest metric (typical range 30–72), with current DRAs missing large fractions of user-relevant claims even when overall clarity and insightfulness are high (Coelho et al., 25 May 2025).
Retrieval and reasoning trade-off: Oracle evidence conditions (all gold docs supplied) boost agent accuracy to >90%, but real retrieval constraints drop performance by 30–50 points, especially in heterogeneous environments (Choubey et al., 29 Jun 2025, Chen et al., 8 Aug 2025).
Plan/reasoning as bottleneck: Isolated planning module evaluation (Dr.Mi-Bench (Guo et al., 30 Nov 2025), LiveDRBench (Java et al., 6 Aug 2025)) shows <30% F1 for decomposition accuracy, indicating that agents under-decompose open-ended tasks.
Citation calibration: Closed-weight DRAs (e.g., Gemini, OpenAI, Perplexity Deep Research) maximize effective citations but often with modest precision (80–90%), while retrieval-focused agents achieve higher precision but at reduced citation breadth (Du et al., 13 Jun 2025, Fan et al., 9 Oct 2025).
Report quality and human agreement: State-of-the-art methods (DeepResearcher Reflect Evolve (Prateek, 28 Jan 2026), Gemini DeepResearch) achieve RACE/overall scores in the 45–50% range on 100-task PhD-level benchmarks, well below the noise ceiling (Du et al., 13 Jun 2025, Prateek, 28 Jan 2026).

Empirical analysis on trace data finds that agents balancing broad search (high fan-out) and adaptive backtracking/plan refinement achieve superior F1 metrics; naive breadth (many sources, no feedback) or over-pruning both degrade performance (Java et al., 6 Aug 2025).

5. Specialized and Multimodal Benchmarks

Beyond general web or scientific report scenarios, recent benchmarks extend evaluation to new modalities and settings:

Enterprise DeepResearch (DRBench, HERB): Emphasizes multi-modal, privacy-preserving search over realistic corporate artifacts (Slack, PRs, chats, private docs) with joint public/private scope and agent tool orchestration (Abaskohi et al., 30 Sep 2025, Choubey et al., 29 Jun 2025).
Private heterogeneous and FAIR-compliance (IoDResearch): Evaluates retrieval, QA, and report synthesis over multi-modality and atomic knowledge graphs, incorporating digital object encapsulation and multigranularity access (Shi et al., 2 Oct 2025).
Vision-DeepResearch (VDR-Bench): Introduces visual-first search and multi-hop reasoning over image crops plus textual expansion; multi-round cropped search strategies are demonstrated as essential for solving visual-grounded queries that cannot be shortcut by world knowledge (Zeng et al., 2 Feb 2026).
Language and region specialization (ADR-Bench, DeepResearch-9K): Focuses on Chinese legal, financial, and policy domains (ADR-Bench), and progressive difficulty scaling (L1-L3, DeepResearch-9K) to reveal ceiling effects and domain adaptation patterns (Hu et al., 23 Dec 2025, Wu et al., 1 Mar 2026).

6. Human Alignment and Evaluation Methodology

Current DeepResearch benchmarks routinely include extensive human curation for prompt design, rubric construction, and validation, and often rely on ensemble LLM-as-judge protocols as scalable human proxies:

Agreement between LLM-judge and expert annotation consistently exceeds 80–90% (Cohen’s κ, mean deviation) for key metrics, ensuring robust, low-variance scores (Fan et al., 9 Oct 2025, Coelho et al., 25 May 2025, Li et al., 13 Jan 2026).
Human preference studies and pairwise win rates, especially in open-ended report settings, anchor the validity of automated scoring frameworks and expose which methods or model families most closely track expert judgments (Du et al., 13 Jun 2025, Hu et al., 23 Dec 2025).

Evaluation is typically stratified by dimension (retrieval, reasoning, presentation) and further down to subdomain, enabling fine-grained analysis of both system and architectural failure modes (e.g., low recall in Sports & Fitness per DeepResearch Bench II (Li et al., 13 Jan 2026), persistent citation hallucination in LiveResearchBench (Wang et al., 16 Oct 2025)).

7. Current Limitations and Future Directions

Despite advances in evaluation design, strong limitations remain:

Recall and depth ceilings: Leading web-scale DRAs regularly satisfy <50% of atomic rubrics in expert-curated benchmarks, with even lower rates on complex, multi-source review tasks (Li et al., 13 Jan 2026, Guo et al., 30 Nov 2025).
Claim grounding: Many systems can surface the right claim or “nugget” but fail to ground it in verifiable evidence, as demonstrated by the variance between claim-level and citation-level results (Java et al., 6 Aug 2025, Han et al., 19 Dec 2025).
Modular bottlenecks: Planning and decomposition modules are the primary bottleneck for agent performance, with gold-plan ablations showing up to +12% improvement in downstream reasoning (Guo et al., 30 Nov 2025).
Domain disparity and modality: Scientific, legal, and multimodal reasoning remains considerably harder than general domain, and human alignment in presentation and depth lags behind factual and structural metrics (Shi et al., 2 Oct 2025, Zeng et al., 2 Feb 2026).
Dynamic evidence and continual evolution: Most benchmarks are currently snapshot-based; efforts are underway to design protocols and corpora resilient to parametric knowledge drift, privacy constraints, and continual web changes (Coelho et al., 25 May 2025, Wang et al., 16 Oct 2025).

Future benchmarks and evaluation methodologies are expected to prioritize:

Integration of real-time fact-checking against evolving sources.
Deeper rubrics on interaction steps (e.g., search refinement, clarification dialogue).
Federated/multi-agent protocols with explicit role decomposition.
Long-horizon, multi-modal, and multi-lingual evaluation in dynamic web and private environments.

References: