Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

DeepResearch-ReportEval Framework

Updated 14 October 2025
  • The paper introduces DeepResearch-ReportEval, a framework that measures research report synthesis, analytical reasoning, and factual accuracy using quantitative metrics.
  • It employs a multi-dimensional scoring system assessing report quality, redundancy, and factuality through LLM-as-a-judge protocols.
  • The framework enables systematic comparison of advanced research agents, providing actionable insights to enhance autonomous, evidence-based research.

DeepResearch-ReportEval is a standardized evaluation framework and benchmark suite for assessing the capabilities of DeepResearch agents—multi-tool, LLM-powered systems that autonomously conduct open-ended research, synthesize evidence, and generate comprehensive research reports. Unlike conventional LLM evaluation benchmarks that focus on isolated sub-skills, DeepResearch-ReportEval is tailored to the holistic measurement of multi-source synthesis, structured analytical reasoning, and the factuality of long-form, citation-supported research outputs. It provides a standardized methodology, quantitative metrics, and curated queries that reflect the end-to-end research objectives and practical challenges encountered in advanced scientific, technical, and societal research settings.

1. Framework Rationale and Scope

DeepResearch-ReportEval addresses the unique evaluation gap that arises when transitioning from short-form or factoid LLM tasks to realistic research-report generation. Whereas traditional metrics (e.g., for MMLU or open-domain QA) fail to capture synthesis and integration, DeepResearch-ReportEval uses the research report itself as the core unit of assessment, targeting capabilities such as:

  • Evidence-based synthesis across heterogeneous sources
  • Analytical depth (beyond paraphrase/aggregation)
  • Coherence, report structure, and clarity
  • Redundancy minimization and avoidance of superficial repetition
  • Supported, source-grounded claims with proper citation

These integrated demands reflect the requirements of real-world research partners, as opposed to mere retrieval or single-turn reasoning outputs (Fan et al., 9 Oct 2025). The framework is designed to support systematic comparison across systems and foster the evolution of DeepResearch agents from information assistants toward autonomous, trustworthy research collaborators.

2. Evaluation Dimensions and Metrics

The DeepResearch-ReportEval framework decomposes evaluation into three principal dimensions, each with formal scoring mechanisms:

2.1 Report Quality

Quality is scored by assessing comprehensiveness, coherence, clarity, insightfulness, and overall writing impression. Each sub-criterion is rated on a 0–4 scale using LLM-as-a-Judge prompt templates. For example, a 4 score for comprehensiveness requires exhaustive coverage of major aspects with integrated analysis, while a 0 reflects severe omissions or incoherence. The process explicitly solicits justifications to aid transparency and expert alignment.

2.2 Redundancy

Redundancy quantifies repetitive or semantically overlapping content within reports. The report is segmented into paragraphs r=(p1,p2,...,pk)r = (p_1, p_2, ..., p_k); every O(k2)O(k^2) paragraph pair is compared using an LLM-as-a-Judge system and scored 0–4 for repetition. Overall redundancy is aggregated as:

ScoreR(r)=1ni=1nScoreR(Pairi)\mathrm{Score}_R(r) = \frac{1}{n} \sum_{i=1}^n \mathrm{Score}_R(\textrm{Pair}_i)

where nn is the number of paragraph pairs. This approach exposes overuse of boilerplate, lack of effective information consolidation, and superficial elaboration.

2.3 Factuality

Factuality restricts the evaluation domain to claims that are linked with explicit sources. For each claim-source pair (ci,si)(c_i, s_i), LLM-as-a-Judge examines the degree of source support: $1$ (fully supported), $0$ (partially), or 1-1 (unsupported). Metrics include:

  • Average support score: ScoreF1(r)=(1/m)i=1mScoreF(ci,si)\mathrm{Score}_{F1}(r) = (1/m) \sum_{i=1}^m \mathrm{Score}_F(c_i, s_i)
  • Strong support rate: proportion of (ci,si)(c_i, s_i) pairs with full support (=1)(=1)

Explicit source checking distinguishes warranted insights from hallucinations or speculative synthesis.

3. LLM-as-a-Judge Paradigm and Expert Alignment

The evaluation pipeline leverages a peer-LLM as a judge, invoked with structured prompts covering each dimension and subcriterion. Key principles include:

  • Iterative prompt refinement: Templates are iteratively tuned to optimize mean absolute deviation (MAD) from expert scores on a calibration set.
  • Justified scoring: For every rating, LLMs produce supporting explanations, enabling post-hoc audit by human experts.
  • Statistical validation: Ranking experiments show match between LLM and expert rankings (e.g., ~61.1% exact ranking match (Fan et al., 9 Oct 2025)).

This alignment protocol yields reliable, scalable report evaluation—critical as report length and multi-source integration complexity grow.

4. Benchmark Structure and Query Diversity

The DeepResearch-ReportEval benchmark consists of 100 manually curated open-ended queries spanning 12 categories:

  • Science & Technology
  • Health & Medicine
  • Economy & Business
  • Politics & Society
  • History & Culture
  • Art, Music & Literature
  • Entertainment & Fashion
  • Sports & Fitness
  • Education
  • Environment & Nature
  • Lifestyle
  • Other

Queries were selected to reflect the breadth of real-world DeepResearch deployments, ensuring challenge diversity. Each system’s reports are compared across these domains to systematically expose strengths and weaknesses.

5. Comparative Analysis of Modern DeepResearch Systems

Application of DeepResearch-ReportEval to four leading commercial systems (OpenAI, Perplexity, Gemini, Qwen) reveals orthogonal strengths and performance trade-offs:

  • Conciseness vs. Insight: Perplexity generates concise, clear reports with high coherence but underperforms in depth and analytical insightfulness.
  • Length and Comprehensiveness: Qwen and OpenAI variants output lengthier, more analytically dense reports, achieving higher comprehensiveness and citation reliability but sometimes at the expense of increased semantic redundancy.
  • Factuality: Proper claim-source support stratifies models with Qwen and OpenAI consistently yielding high support rates, reflecting robust evidence grounding.

These distinct behaviors correspond to divergent architectural and training philosophies, such as system-level routines for evidence collection, chunk-wise synthesis versus plan-then-write, and optimization strategies for coverage or brevity (Fan et al., 9 Oct 2025).

6. Technical Summary: Mathematical Formalization

The core scoring pipeline may be formally summarized as follows:

  • For each report rr, quality, redundancy, and factuality are assessed using LLM-judge-assigned scores following:
    • Pairwise redundancy averaging for kk paragraphs
    • Average and strong support rates for mm claim–source pairs
    • Integrated report quality Qi{Q_i} for dimension ii across N subcriteria

This scoring suite is designed to resemble the structured multidimensional rubrics employed in specialist peer review of research outputs.

7. Implications and Future Directions

The emergence of DeepResearch-ReportEval as a benchmarking tool marks a paradigm shift in the automated assessment of LLM-powered research agents:

  • It enables not only precise measurement of core capabilities, but also the identification of trade-offs between coverage, conciseness, and factual support.
  • The unified, multi-dimensional approach advocated in DeepResearch-ReportEval anticipates future system development by providing actionable signals for optimization (e.g., targeting specific dimensions such as redundancy reduction or citation accuracy).
  • Extension to more fine-grained process-level feedback (e.g., step- or span-level judgment), integration with domain-specific rubrics, and broader community involvement in expert alignment are plausible and anticipated next steps.

These foundations position DeepResearch-ReportEval as a core framework underpinning comparative system assessment, ablation diagnostics, and the training/evaluation of future, more autonomous and reliable research AI (Fan et al., 9 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DeepResearch-ReportEval.