Dr.Mi-Bench: Deep Research Agent Benchmark
- Dr.Mi-Bench is a modular-integrated benchmark designed to evaluate scientific deep research agents across planning, retrieval, and reasoning modules.
- It features a human-annotated dataset of 200 instances across 10 scientific disciplines, enabling granular diagnosis of agent performance and domain variability.
- Its dual-mode evaluation framework, with end-to-end and isolated analyses, highlights failure modes and directs improvements in agent planning and multi-source retrieval.
Dr.Mi-Bench is a modular-integrated benchmark specifically designed to evaluate scientific deep research (DR) agents. Addressing the limitations of prior benchmarks—which often focus solely on retrieval and general domains—Dr.Mi-Bench assesses the full deep-research stack: high-level planning, targeted retrieval, and multi-hop reasoning, with a focus on scientific literature. Its modular architecture and rigorous annotation pipeline enable granular diagnosis of DR agent capabilities and failure modes across diverse scientific disciplines (Guo et al., 30 Nov 2025).
1. Dataset Construction and Task Taxonomy
Dr.Mi-Bench comprises a human-annotated dataset of 200 research instances spanning 10 scientific disciplines: Materials Science, Finance, Chemistry, Computer Science, Medicine, Biology, Environmental Science, Energy, Building and Construction, and Earth Science. Each discipline contributes 20 instances, ensuring balanced cross-domain representation.
Tasks are bifurcated into two types: research-style and review-style. Research tasks focus on deep analysis of a single primary source, with a gold evidence set restricted to the source paper’s DOI. Review tasks require breadth, mandating multi-source retrieval where the gold evidence set consists of the full bibliography of the prompt paper (usually dozens of DOIs or arXiv identifiers). This dual taxonomy enables benchmarking of both depth-centric and breadth-centric agent behaviors.
The annotation schema is as follows:
- Each source paper’s abstract is translated into an open-ended, domain-abstracted research question.
- A high-level gold plan of 5–10 sub-tasks is drafted by one expert and adjudicated by another, producing a multi-step chain-of-thought (from background to conclusion).
- The gold evidence set is assembled by combining automated retrieval (CrossRef, Semantic Scholar) and manual curation.
- Diagnostic labels consist of 15–20 declarative boolean statements, each marked true/false by dual annotators (overall inter-annotator κ = 0.89).
2. Modular-integrated Evaluation Framework (Dr.Mi-Eval)
Dr.Mi-Eval is a modular, two-mode evaluation suite that isolates and integrates the essential competencies of DR agents: planning (π), retrieval (ρ), and reasoning (σ).
Given query , corpus , toolset , and budget , a DR agent executes:
with decomposed as :
- : Transforms into a sequence of sub-tasks (planning).
- : Selects and retrieves evidence based on (retrieval).
- : Produces the final report from (reasoning).
Two evaluation modes are employed:
- End-to-End: All modules operate freely (reflecting full agent performance).
- Isolated: Each module is evaluated in turn, holding inputs fixed to gold references, which disentangles error propagation.
3. Metrics and Diagnostics
Performance is assessed via information retrieval metrics:
- For predicted set and gold set :
- True Positives (TP):
- False Positives (FP):
- False Negatives (FN):
- Precision:
- Recall:
- F1 Score:
- Accuracy (Jaccard index): (for planning and retrieval)
Module-specific diagnostics:
- Planning (): LLM-judge comparison of predicted and gold plans for coverage, redundancy, and structural integrity.
- Retrieval (): Exact match on source DOI or 20-character title prefix to validate provenance.
- Reasoning (): Reports are scored by matching their assertions to gold statements (D), aggregating Accuracy, Precision, Recall, and F1.
- Efficiency: Latency versus accuracy, and token count throughput analyses.
4. Experimental Findings and Failure Analyses
Evaluation covered four commercial DR agents (OpenAI o3, Gemini-2.5-Pro DR, Perplexity Sonar Pro DR, xAI Grok 3), the Search-r1 baseline, and foundational LLMs (GPT-o3, Claude-3.7, Grok-4, Qwen, Llama-3), with GPT-4o as the automated judge (agreement 93% with Gemini-2.5-Pro).
| Model | Planning F1 | Retrieval Acc | Reasoning F1 |
|---|---|---|---|
| OpenAI | 25.33 | 31.35 | 47.45 |
| Gemini-DR | 25.33 | 36.77 | 47.66 |
| Perplexity | 26.45 | 34.82 | 39.13 |
| Grok | 25.09 | 39.81 | 30.10 |
| Search-r1 | – | 24.26 | 24.08 |
Principal observations:
- Planning: Uniformly low F1 scores for all agents (~25%), even among those with explicit planning subsystems.
- Retrieval: High for research tasks (62–76%) but drastically lower (<4%) for review-style multi-source retrieval.
- Reasoning: Middling scores for Gemini and OpenAI (≈59% accuracy, 40% recall); strong retrieval does not guarantee synthesis (e.g., Grok: ≈76% retrieval but ≈31% reasoning).
- Domain Variability: Accuracy differs by >20 percentage points between fields—highest in Medicine (≈68%), Biology, Environmental Science; lowest in Finance (≈45%), Materials (≈56%), Earth Science—suggesting pretrained model domain biases and intrinsic disciplinary complexity.
- Efficiency Trade-offs: Grok optimally balances retrieval speed (~76 s, 39.8% accuracy). For reasoning, OpenAI provides highest accuracy (59%) at maximum latency; Gemini achieves comparable accuracy with greater token efficiency.
- Failure Modes: Most salient are poor sub-task planning, inability to retrieve multi-source evidence for reviews, and decreased generalization in underrepresented domains.
A crucial ablation (“Gold Plan Injection”) showed that substituting predicted plans with gold-standard plans (Π*) in the agent pipeline raises OpenAI’s reasoning accuracy from ≈59% to ≈71%, identifying high-level planning as a primary performance bottleneck.
5. Benchmark Implications and Recommendations
Dr.Mi-Bench enables actionable diagnostics for DR agent development:
- Emphasize Planning: Invest in dedicated prompt-tuning, plan-generation finetuning, and incorporate symbolic/structured planning backbones that can be cross-validated against paper outlines.
- Multi-Source Retrieval: Integrate citation-aware retrievers and iterative retrieval loops (plan → retrieve → plan-refine → retrieve) to address the multi-source gap in review tasks.
- Domain Generalization: Expand training corpora and prompt libraries to account for less-represented fields (e.g., Finance, Materials) and improve cross-domain robustness.
- Modular Evaluation: Apply Dr.Mi-Eval’s isolated mode during model development to localize errors to planning, retrieval, or reasoning subsystems.
- Future Directions: Extend the benchmark to include less-cited pre-2024 works, add modules that support iterative self-correction (reflection), and scale diagnostics to multi-lingual and multimodal scientific content.
6. Significance within Research Agent Development
By diagnosing granular failure modes in high-level planning, evidence retrieval, and reasoning, Dr.Mi-Bench provides a rigorous testbed and prescriptive roadmap for building reliable, domain-specialized scientific research agents. Its modular design facilitates both holistic and component-level evaluations, supporting both descriptive benchmarking and targeted system improvement. The benchmark’s focus on breadth (ten diverse domains) and depth (multi-step planning with human adjudication) establishes it as a reference point for the next generation of automated academic research assistants (Guo et al., 30 Nov 2025).