MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Published 30 Mar 2026 in cs.AI and cs.CL | (2603.28407v1)

Abstract: Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

Abstract PDF Upgrade to Chat

Authors (22)

First 10 authors:

Summary

The paper introduces a comprehensive evaluation framework for multimodal research agents that benchmarks both process and outcome using real-user and LLM-generated tasks.
It employs a three-layer evaluation pipeline assessing synthesis quality, factuality via multimodal evidence verification, and process transparency through atomic operation tracking.
Empirical results reveal that process metrics strongly correlate with overall performance and highlight the challenges posed by multimodality in maintaining report specificity.

MiroEval: A Comprehensive Evaluation Framework for Multimodal Deep Research Agents

Motivation and Benchmark Design

The proliferation of agentic research systems—embodied by autonomous LLM-based agents capable of long-context reasoning, iterative information acquisition, and multimodal evidence integration—necessitates diagnostic benchmarks that rigorously evaluate both outcome and process. Prevailing benchmarks focus predominantly on static, rubric-based assessment of terminal outputs, often restricting their scope to text-only contexts or synthetic queries, thus failing to measure the core competencies demanded by real users: rigorous, factually grounded, auditable research workflows capable of leveraging text and multimodal information sources.

MiroEval introduces a benchmark suite and evaluation paradigm addressing these limitations through three axes: (i) real-user, continuously updatable task suite construction, (ii) multidimensional, agentic evaluation encompassing synthesis quality, factuality, and process transparency, and (iii) comprehensive support for multimodal research tasks, including attachment-grounded analysis.

Task Suite Construction and Coverage

To ensure task diversity, authenticity, and temporal relevance, MiroEval employs a dual-path pipeline for query generation (Figure 1):

Figure 1: Query construction pipeline depicting the paths for user-derived and auto-generated queries, privacy-preserving rewriting, and multi-stage filtering.

The benchmark comprises 100 deep research tasks (70 text-only, 30 multimodal) spanning 12 domains and 10 task archetypes (Figure 2). The first path curates 65 queries via privacy-preserving abstraction and rewriting of authentic user interactions, stratified by difficulty and evaluation feature representation. The second path generates 35 queries using LLM-guided distillation of emergent real-world trends—anchored in live web headlines and filtered for research necessity, information-seeking complexity, and parametric knowledge limitations.

Figure 2: (a) Joint domain×task-type distribution, (b) domain distribution, (c) task-type distribution for all 100 tasks.

Tasks encompass major verticals (technology, finance, science, engineering, healthcare, etc.) and research patterns (comparative analysis, causal explanation, data analysis, code synthesis, etc.). Rigorous human verification achieves 92.0% majority-vote precision, and construction protocols enable periodic refresh, preventing benchmark staleness.

Multi-Dimensional Agentic Evaluation Pipeline

MiroEval formalizes evaluation as a three-layer process, decoupling report artifact assessment from research trajectory audit (Figure 3):

Figure 3: The evaluation pipeline—spanning synthesis quality, factual verification, and process-centric audit.

1. Adaptive Synthesis Quality Evaluation.

Task-specific rubrics are generated dynamically, augmenting four universal dimensions (coverage, insight, instruction-following, clarity) with context-grounded axes (e.g., attachment-grounded specificity, evidence-based analytical rigor). For multimodal queries, the framework enforces alignment with extracted key facts from attachments, penalizing decontextualized or invented detail. Dimension and criterion weights are learned per task, and report scores are aggregated over this weighted multidimensional space.

2. Agentic Factuality Verification.

Reports are decomposed into atomic claims via agent-driven extraction. Each claim is checked against both external web resources and task-provided multimodal evidence (PDFs, images, spreadsheets), employing both native multimodal model reasoning and retrieval-augmented chunk analysis. Consistency is labeled as RIGHT, WRONG, CONFLICT (cross-source contradiction), or UNKNOWN, explicitly tracking cross-modal and inter-source discrepancies. This enables isolation of factuality failure modes not detectable in purely text-based settings.

3. Process-Centric Evaluation.

Beyond output, MiroEval audits research procedures. Raw system trajectories are abstracted into structured process graphs: atomic operations (retrieval, reading, synthesis, hypothesis generation, contradiction handling), dependencies, and surfaced findings. Intrinsic process metrics include search breadth, analytical depth, iterative refinement, critical thinking, and redundancy minimization. The framework assesses process–report alignment bidirectionally: all report conclusions must be traceable to procedural findings (R→P) and key procedural findings must appear in the final report (P→R). Contradiction handling is separately measured.

Systematic Results and Empirical Insights

Figure 4 visualizes performance on 70 representative text-only tasks across Synthesis, Factuality, and Process:

Figure 4: Comparative model performance on 70 text-only tasks across three evaluation dimensions.

Key empirical findings include:

Non-Redundancy Across Dimensions. System rankings are non-uniform across Synthesis, Factuality, and Process. Notably, Kimi-K2.5 achieves the highest non-MiroThinker Synthesis but ranks near the bottom in Factuality, while the reverse occurs for Manus-1.6-Max Wide Research.
Process Quality as a Predictor. Strong alignment is observed between process-centric scores and overall system outcome (Pearson's ρ = 0.88), underscoring the necessity of process-based evaluation for robust agent assessment.
Multimodality Amplifies Challenge. Introduction of attachments causes a consistent 3–10 point drop in overall scores across evaluated systems, mainly driven by declines in report specificity and process depth, not factuality.

The MiroThinker series demonstrates balanced performance, with H1 achieving the highest overall and subdimension scores. Noteworthy is the series' capacity to maintain high claim volume and low error rates simultaneously, breaking the traditionally observed precision–volume tradeoff (Figure 5):

Figure 5: Left—synthesis quality vs. factuality, showing weak correlation. Right—statement count vs. right ratio, indicating a negative correlation except for top-tier systems.

Further analyses reveal:

User-Derived vs. Auto-Generated Tasks: User-derived queries are systematically more difficult, yet relative system rankings are stable, supporting robust, extensible benchmark construction.
Systematic Failure Modes: The main bottlenecks are lack of report specificity, shallow analytical reasoning in the research process, and traceability gaps (unlinked report claims).
Evaluation Robustness: Human and alternative-LM judges reproduce ranking with Kendall’s τ = 0.91, and cross-configuration evaluation yields <2 point standard deviations.

Implications and Future Directions

MiroEval defines a comprehensive standard for benchmarking deep research agents, enabling phase-aligned optimization and fine-grained diagnosis. Multimodal research challenges remain acute—integration of attachment-grounded reasoning, process-traceable synthesis, and contradiction resolution over heterogeneous and evolving corpora are open problems.

Practically, MiroEval provides actionable diagnostics for model development pipelines, guiding improvements in multimodal understanding, procedural transparency, and fact-grounded synthesis. The benchmark’s live-refresh feature suggests a pathway for model–benchmark co-evolution, reducing overfitting and ensuring continued relevance to emergent user needs.

Theoretically, the work exposes fundamental deficiencies in current agent architectures—particularly around process–output decoupling, factual discipline, and alignment under ambiguity—directing future research toward architectures that optimize not just for output quality but transparent, auditable research workflows. Methodological advances are still needed in multi-agent conflict adjudication, adaptive rubric construction, and agentic long-context trajectory modeling.

Conclusion

MiroEval establishes a rigorous agentic evaluation paradigm for deep research, integrating user-aligned, multimodal, and temporally adaptive benchmarks with multidimensional process/outcome auditing. The frameworks and empirical baselines presented provide essential infrastructure for advancing research agent capabilities, system transparency, and trustworthiness at the intersection of language, reasoning, and multimodal understanding (2603.28407).

Markdown Report Issue