MiroEval Benchmark Framework

Updated 3 July 2026

MiroEval is a comprehensive benchmarking framework that evaluates multimodal deep research agents by jointly assessing synthesis quality, factuality, and process efficiency.
It employs a dual-path task generation strategy combining user-derived and auto-generated tasks with periodic refresh to maintain real-world relevance.
Comparative findings indicate that while synthesis and process quality correlate, multimodal tasks present unique challenges in grounding and factuality.

MiroEval is a comprehensive benchmarking and evaluation framework for multimodal deep research agents, designed to address key deficiencies in prior evaluation protocols by jointly assessing process quality, factuality, and adaptive, task-specific synthesis. Its architecture is grounded in real user needs, supports periodic task refresh for temporal relevancy, and offers a multidimensional diagnostic instrument for the research community developing agentic, retrieval-augmented, or multimodal LLMs (Ye et al., 30 Mar 2026).

1. Benchmark Construction and Data Sourcing

MiroEval consists of 100 deep-research tasks (70 text-only, 30 multimodal), generated through a dual-path pipeline to ensure both authenticity and updateability.

User-Derived Path: Tasks are sourced from anonymized MiroMind query patterns. A privacy-preserving rewriting stage combines rule-based and model-assisted filters to remove sensitive content, supplemented by automatic entity anonymization. Each rewritten query is classified by a LLM across seven axes (attachment type, domain, complexity, and eight evaluation features, e.g., Factuality, Multimodal Understanding), then routed via six difficulty-stratified rewrite strategies to achieve broad capability coverage and quota balancing.

Auto-Generated Path: This path leverages web trends for topic coverage. For each of twelve topics (with three subtopics each), recent headlines are fetched via the Serper API. A model generates 180 candidate queries, which undergo (1) search validation (requiring ≥3 search results from ≥2 domains), (2) LLM-estimated necessity of external research, and (3) inverse quality filtering to discard queries answerable via parametric knowledge. The kept set is finalized for diversity by expert curation.

Task Refresh: Both pipelines are periodically re-executed as user logs grow or web trends shift. This ensures MiroEval remains robust against overfitting and static benchmarking.

2. Multidimensional Evaluation Suite

MiroEval evaluates deep research agents along three orthogonal axes:

Adaptive Synthesis Quality: For each query $Q=(I, A)$ , the scoring rubric $D$ is the union of $D_{\mathrm{fixed}}$ (Coverage, Insight, Instruction-Following, Clarity) and task-conditioned $D_{\mathrm{dynamic}}(Q)$ (typically, 1–3 domain expertise dimensions or, for multimodal tasks, “Grounding” and expertise). Multimodal attachments are parsed for atomic key facts, guiding precise grounding criteria. An LLM assigns mutually justified dimension weights $W_d$ and per-criterion weights $w_{d, c}$ under each dimension. Synthesis criteria $s_{d, c}$ for a report $R$ on $Q$ are scored in $[0, 10]$ , aggregated as

$D$ 0

Agentic Factuality Verification: Outputs are decomposed into verifiable statements $D$ 1. Evidence $D$ 2 is retrieved both from web search and provided attachments; attachments are processed either natively (for images, PDFs, and plain text, using multimodal LLMs) or via retrieval-augmented chunking/indexing (for non-native formats, e.g., spreadsheets). Each $D$ 3 receives a label $D$ 4. Factuality is tallied as

$D$ 5

Process-Centric Evaluation: Raw agent logs are parsed into atomic units (search, read, analyze, plan, revise), their interdependencies, and process findings. “Intrinsic quality” ( $D$ 6) averages scores over Search Breadth, Analytical Depth, Progressive Refinement, Critical Thinking, and Efficiency. “Alignment quality” ( $D$ 7) quadratic-averages alignment scores: Process→Report coverage, Report→Process traceability, and Contradiction detection. Overall process score is

$D$ 8

for $D$ 9.

3. Implementation and Operational Infrastructure

Retrieval Tools: MiroEval directly integrates with live web search APIs and supports direct multimodal LLM analysis for native-format attachments. Non-native attachments are indexed as text using chunking and BM25-style vector search for targeted retrieval.

Agentic Loop for Factuality: For each claim, the agentic factuality component iteratively generates queries, fetches evidence, updates its belief, and stops based on a confidence threshold or step cap.

Process Structuring: Log sequence parsing leverages LLM classification to segment actions, extract operation metadata (type, timestamp, I/O), assemble a dependency graph over steps, and extract high-novelty “analyze” findings.

Evaluation Automation: Synthesis, process, and factuality are independently judged using large proprietary models (e.g., GPT-5.1, GPT-5.2, GPT-5-mini), ensuring consistency and scalability.

4. Experimental Findings and Comparative Results

Thirteen systems were evaluated on both text-only and multimodal settings, with three main findings:

The three evaluation axes are complementary: “Kimi-K2.5” scored highly on Insight (79.8) but trailed on factuality (65.4), while “Manus-1.6-Max” reversed this pattern, indicating that synthesis and factuality detect distinct model behaviors.
Process quality is a reliable proxy for holistic agent capability (Pearson $D_{\mathrm{fixed}}$ 0[OverallOutcome, Process] $D_{\mathrm{fixed}}$ 1). Systems ranking top on process (MiroThinker-H1, OpenAI Deep Research) are also top on overall outcome, even if not always leading in single dimensions.
Multimodal tasks are systematically harder: transition from text-only to multimodal caused synthesis drops of –6 points on average and overall drops of 3–10 points; factuality, however, only declined by –0.2, indicating attachment “grounding” and multimodal reasoning as the core challenge.

Representative overall results (averaged across synthesis, factuality, and process dimensions):

Model	Text-Only (Overall)	Multimodal (Overall)	Aggregate
MiroThinker-H1	77.5	74.5	76.6
MiroThinker-1.7	74.3	71.6	74.3
OpenAI Deep Research	74.8	70.2	74.8
Gemini-3.1-Pro Deep Research	69.3	68.1	69.3
Claude-Opus-4.6 Research	67.3	66.4	67.3

5. Reproducibility, Human Validation, and Robustness

MiroEval’s benchmark construction and LLM-based scoring are validated via multiple experiments:

Human Verification: Three annotators, evaluating 50 queries, yield Fleiss’ κ(validity) $D_{\mathrm{fixed}}$ 2, κ(non-triviality) $D_{\mathrm{fixed}}$ 3, and 92.0% majority-vote precision, confirming construct validity.
Stability: Intra-judge runs (same GPT, 3x) have overall $D_{\mathrm{fixed}}$ 4 and identical ranking; prompt variants yield ≤2 point changes and preserve ranking; cross-judge (Gemini vs GPT) shifts absolute values (by ~13–17 points) but yields perfect Kendall’s τ ranking agreement. Human-vs-MiroEval system rankings give Kendall’s τ=0.91.
Periodic Refresh: Both the user-derived and auto-generated task pipelines can be re-executed on new log data or trend topics. Transparent LLM-based filtering (e.g., $D_{\mathrm{fixed}}$ 5) prevents staleness and preserves benchmark relevance.

6. Implications, Limitations, and Future Directions

The multidimensionality of MiroEval precludes reliance on a single metric; polished synthesis often does not correspond to factual reliability, and process audits expose latent weaknesses unobserved in static report evaluation. As multimodal task complexity rises, robust grounding and fact retrieval become primary bottlenecks.

The following implications follow directly:

Benchmarks must jointly assess both process and outcome to prevent overconfident deployment of systems that can generate plausible but ungrounded outputs.
Deep research agents should integrate multimodal input early, using attachment parsing to constrain downstream retrieval and synthesis.
Analytical depth, efficiency, and advanced multimodal understanding are critical targets for future agent development to close the observed multimodal performance gap.

A plausible implication is that continual refresh—enabled by both log-derived and web-trend-driven task generation—will remain essential as system capabilities and application domains evolve. MiroEval’s design as a live, evolving benchmark mitigates the risk of overfitting and ensures that agent evaluation aligns with real-world research workflows (Ye et al., 30 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MiroEval.