DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Published 16 Apr 2026 in cs.AI | (2604.14683v1)

Abstract: Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR$^{3}$-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR$^{3}$-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR$^{3}$-Agent based on multiple state-of-the-art LLMs demonstrate that DR$^{3}$-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

Abstract PDF Upgrade to Chat

Authors (19)

First 10 authors:

Summary

The paper introduces DR3-Eval, a benchmark that rigorously evaluates deep research agents in realistic, multimodal research tasks using static sandboxes.
It employs a divergent-convergent process to generate task-specific corpora with supportive, distractor, and noise documents, ensuring fine-grained evaluation.
Experimental results reveal persistent issues such as evidence hallucination and instruction-following challenges, highlighting the need for robust multimodal synthesis.

DR $^{3}$ -Eval: Advancing Multimodal, Realistic, and Reproducible Deep Research Evaluation

Motivation and Problem Formulation

Current state-of-the-art Deep Research Agents (DRAs) are expected to autonomously complete complex, long-horizon research workflows that extend far beyond traditional QA, encompassing strategic planning, multimodal information retrieval, iterative evidence evaluation, and structured, citation-grounded report generation. However, existing evaluation protocols for DRAs are either brittle (due to reliance on dynamic web sources and ambiguous queries) or insufficiently realistic (clean, text-only, or synthetic tasks), thus failing to simultaneously capture real-world complexity, reproducibility, and verifiability.

DR $^{3}$ -Eval seeks to resolve this by introducing a benchmark that is both grounded in authentic, user-provided, multimodal research artifacts and built upon a per-task, static research sandbox. This framework allows for rigorous, repeatable, fine-grained evaluation of DRAs’ abilities to retrieve, synthesize, and ground information amidst controlled distractors and noise.

DR $^{3}$ -Eval Framework and Architecture

The benchmark is constructed with three interlocking pillars: (1) task generation from real-world multimodal user files and queries devised via a divergent-convergent keyword mechanism, (2) building for each task an associated static sandbox corpus with realistic distributions of supportive, distractor, and noise documents, and (3) a multidimensional evaluation suite capturing both evidence acquisition and analytical report quality.

Figure 1: Overview of DR $^{3}$ -Eval, showing data construction, hierarchical multi-agent DRA design, and the evaluation protocol.

Data Construction and Task Grounding

The dataset is built by recruiting domain-diverse participants (undergraduate/graduate students) to provide multimodal research resources (text, pdf, spreadsheets, images, video, audio), sanitized for privacy. The topics span a broad disciplinary range across Technology, Economy, and Humanities, decomposed into 13 sub-domains, creating 100 tasks (50 English, 50 Chinese).

Figure 2: Distributions of domains, modalities, and file counts in DR $^{3}$ -Eval tasks.

A distinctive divergent-convergent process is used for search path engineering: high-level, conceptually varied “signal” (core evidence) and “noise” (distractor) keywords are generated. The retrieval corpus for each task is then populated via web-scale crawling using these keywords, followed by stringent manual curation and tripartite classification: supportive evidence, contextually misleading distractors, and noise. This enables controlled signal-to-noise adjustment and verifiable solution paths.

Sandbox Corpus and Semantic Distribution

The sandbox architecture provides a configurable, context-length-scaled evaluation environment enabling systematic analysis of DRA robustness under realistic information overload and ambiguity.

Figure 3: t-SNE projection demonstrates semantic diversity and clustering among supportive, distractor, and noise web pages.

Task Design and Quality Control

Queries are constructed in reverse from ground-truth documents, ensuring every research task admits a unique, fully evidence-grounded solution. Explicit controls eliminate tasks with shortcut solvability, ambiguous objectives, or direct answer retrievability—resulting in a high-purity, low-ambiguity benchmark.

Hierarchical Multi-Agent Architecture for DR $^{3}$ -Eval

The reference DRA system (DR $^{3}$ -Agent) is operationalized atop the MiroFlow multi-agent framework, with a perception-augmented main agent and lightweight sub-agents orchestrating sandbox retrieval (iterative dense retrieval with query refinement using ReAct-style prompting) and fine-grained multimodal file parsing.

This architecture is necessary to bridge the cross-modal, cross-source grounding required by DR $^{3}$ -Eval, allowing fused reasoning across user files and evidence in the sandbox corpus. The sub-agent memory separation and summarization mitigate main context bottlenecks, supporting efficient hierarchical decomposition of retrieval and synthesis subtasks.

Multidimensional Evaluation Protocol

DR $^{3}$ -Eval introduces a five-metric suite:

Information Recall (IR-UF/IR-SC): Granular coverage of atomic, manually validated insights from (UF) user files and (SC) sandbox corpus.
Citation Coverage (CC): Proportion of gold-standard required sources (supportive docs and user files) explicitly and correctly cited.
Factual Accuracy (FA): Automated, claim-by-claim verification against ground evidence (textual and non-textual).
Instruction Following (IF): Rubric-based checklist compliance derived from the explicit and implicit requirements of the query.
Depth Quality (DQ): LLM-expert-judged analytic and logical depth of the generated report.

Strong alignment between LLM-as-judge outcomes and human expert judgments is empirically demonstrated (Pearson $r=0.78$ , Spearman $^{3}$ 0).

Experimental Results and Failure Mode Analysis

Experiments benchmark competitive proprietary and open-source LLM-based DRAs (Claude Sonnet 4, GLM-4.7, Gemini-2.5-Pro, GPT-4.1, Qwen3-series, among others). Notable outcomes include:

High task difficulty and low model saturation: Even the best-performing model (Claude Sonnet 4) achieves only moderate scores, with significant performance degradation as context length (noise) increases.
Scaling law persistence: Larger model variants consistently outperform smaller counterparts.
Instruction following–factual accuracy decoupling: Several models (e.g., Qwen3-235B-A22B, GPT-4.1) “please” the rubric yet hallucinate or fail to retrieve sufficient supporting evidence, a clear indicator of shallow evidence integration.

Failure-type attribution reveals persistent hallucination as the dominant error, followed by retrieval and reasoning failures, highlighting the ongoing brittleness in LLMs’ evidence adherence, especially under complex, distractor-rich reasoning chains.

Benchmark Stability and Design Validity

Evaluation stability is confirmed through variance analysis, bootstrap re-ranking, and cross-judge consistency, all showing low volatility and high ranking robustness. Citation and evidence utilization in sandbox-only and live web settings are closely matched, validating the “closed world” design as a faithful proxy for real-web complexity.

Implications and Future Directions

DR $^{3}$ 1-Eval constitutes a step-change in DRA benchmarking by:

Enabling controlled, repeatable, and fine-grained evaluation of retrieval and synthesis under real-world multimodal, noisy, and ambiguous research contexts.
Illuminating critical failure modes—especially evidence hallucination—that are obscured by text-only or QA-style tasks.
Providing a framework extensible to other research domains, modalities, and agentic architectures, catalyzing the robust, transparent development of research agents.
Supporting the development and assessment of new agent architectures for context management, evidence routing, and cross-modal reasoning—areas where current approaches show persistent failure.

DR $^{3}$ 2-Eval also advances the conversation on environmental and privacy aspects of LLM evaluation, encouraging sustainable, static benchmarking practices and rigorous data de-identification in multimodal settings.

Conclusion

DR $^{3}$ 3-Eval (2604.14683) sets a new bar for realistic, reproducible, and multimodal benchmarks for DRAs in the report-generation paradigm. Its hybrid of user-grounded, authentically complex tasks, static yet semantically-rich sandboxes, and comprehensive evaluation signals critical research directions for agent robustness, hallucination mitigation, and cross-modal reasoning. It will serve as a foundation for advancing the evaluation and reliable deployment of truly autonomous research systems.