Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 199 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

ResearchRubrics: Benchmark for Evaluating DR Agents

Updated 12 November 2025
  • ResearchRubrics is a structured framework that uses expert-crafted, multi-axis rubrics to evaluate deep research agents on complex, evidence-backed queries.
  • It organizes evaluation criteria into six key axes including synthesis, explicit requirements, and communication quality to ensure comprehensive assessment.
  • It employs a multi-axis complexity framework and LLM-as-judge grading to diagnose performance gaps, guiding improvements in state-of-the-art DR agents.

A research rubric is a structured, criterion-referenced framework for evaluating the quality and completeness of open-ended responses by LLM–driven agents on complex research queries. ResearchRubrics (Sharma et al., 10 Nov 2025) is a rigorously engineered benchmark for this purpose, pairing realistic, domain-diverse prompts with thousands of meticulously constructed expert rubrics and supporting a reproducible, multi-axis evaluation pipeline for Deep Research (DR) agents.

1. Definition and Benchmark Scope

ResearchRubrics operationalizes the evaluation of Deep Research agents—autonomous LLM-based systems designed to synthesize evidence and generate long-form, multi-source, evidence-backed answers. Unlike standard QA or short-form assessment, DR tasks require aggregation of knowledge across documents, multi-step reasoning, and explicit citation of supporting evidence. ResearchRubrics directly addresses deficiencies in previous benchmarks that favor short, atomistic outputs or rely on LLM-generated/self-referential rubrics and static corpora, which are poorly suited to the dynamic and subjective nature of open-ended research tasks.

The benchmark comprises:

  • 101 open-ended, single-turn prompts spanning nine categories: business, technical, consumer, historical, creative, current events, AI/ML, STEM, and hypotheticals/philosophy.
  • Each prompt is accompanied by a set of 20–43 human-written rubric criteria (average 25.7 per task; total 2,593), independently reviewed and iteratively refined by three distinct STEM-trained experts, with no LLM seeding.

2. Rubric Construction and Organization

Human Annotation Protocol and Weighting Scheme

Rubric construction adheres to a strict protocol:

  • Criteria authoring is entirely human-led, with each rubric iteratively reviewed (draft → peer → independent review).
  • Every criterion is classified as either Mandatory (±4 or ±5 weight) or Optional (±1 to ±3). Weights span {–5,…, +5} and are mapped to a six-level preference scale (from Critically Detrimental to Critically Important).
  • Negative weights penalize errors (e.g., factual inaccuracies, off-topic content); positive weights reward relevance and correctness.

Rubric Axes

Each prompt’s criteria are partitioned into six high-level axes:

  • Explicit Requirements: Addressing all points directly stated in the prompt.
  • Implicit Requirements: Covering points expected by knowledgeable readers (side effects, risks, cost, etc.).
  • Synthesis of Information: Integration across sources, penalizing pure listing.
  • Use of References: Citation specificity, correctness, and contextual relevance.
  • Communication Quality: Clarity, structure, tone, and audience fit.
  • Instruction Following: Adherence to explicit constraints (format, exclusions).

This organization enables granular failure-mode analysis and supports interpretability for both agent developers and research users.

3. Complexity Framework for Task Characterization

To systematically categorize the deepness of DR prompts, ResearchRubrics introduces a three-axis complexity framework:

Complexity Axis Levels
Conceptual Breadth Simple (1 domain) * Moderate (2–5 subtopics) * High (>5 disjoint domains)
Logical Nesting Depth Shallow (1-step) * Intermediate (2–3 nested) * Deep (≥4 hierarchical steps)
Exploration Low (fully specified) * Medium (1–2 ambiguous factors) * High (≥3, require framing)

Prompt lengths of 13–315 words (mean ≈88) reflect graded complexity, with longer prompts corresponding to higher multidimensional task challenge along these axes.

4. Scoring and Evaluation Protocols

LLM-as-Judge Grading

For each agent response and rubric criterion rir_i, graders assign: mri{0 (Not Satisfied), 0.5 (Partially), 1 (Satisfied)}m_{r_i} \in \{0 \text{ (Not Satisfied)},\ 0.5 \text{ (Partially)},\ 1 \text{ (Satisfied)}\} The aggregate normalized score for a task SkS_k is: Sk=riCwrimririC,wri>0wri,S_k = \frac{\sum_{r_i \in C} w_{r_i} \cdot m_{r_i}}{\sum_{r_i \in C,\,w_{r_i}>0} w_{r_i}}, where wriw_{r_i} is the criterion’s weight and CC is the full set of criteria on the prompt.

Fine-Grained Failure Analysis

Failure rates are decomposed by category: Fc=1TctTcnfail,c,tnfail,t\overline{F}_c = \frac{1}{|T_c|} \sum_{t\in T_c} \frac{n_{fail,c,t}}{n_{fail,t}} where nfail,c,tn_{fail,c,t} is the count of Not Satisfied for category cc in task tt, nfail,tn_{fail,t} total failures, and TcT_c the set of tasks including cc.

Human–Model Alignment

Agreement between human and LLM-based judges is measured by Macro-F1F_1 per rubric label; collapsing “Partially” to “Not” raises Macro-F1F_1 to ≈0.76, improving alignment by ≈20 percentage points.

5. Experimental Evaluation and Failure Modes

System Performance

System (Ternary) Gemini DR OpenAI DR Perplexity DR
Avg Score 0.677 0.664 0.566
Strict (Binary) 0.615 0.597 0.487

On average, even state-of-the-art DR agents comply with <68% of rubric criteria, with leading-edge performance notably limited by implicit context and synthesis reasoning.

Diagnostic Insights

  • Failure Decomposition: Implicit Reasoning and Synthesis account for ~45–50% of failures; explicit and communication axes <20%.
  • Criteria Impact: Failures among mandatory criteria drive explicit and synthesis deficits; optional criteria relate mainly to missed implicit requirements.
  • Complexity Sensitivity: Performance degrades monotonically with logical depth (−20–25 points from shallow to deep tasks), increased conceptual breadth, and more exploration/ambiguity.
  • Length and Citation Tradeoffs: Longer responses correlate modestly (r≈0.24–0.28) with rubric satisfaction. High breadth (more citations: Gemini ≈111@81% accuracy vs. Perplexity ≈31@90%) cannot be simultaneously achieved with high citation precision.

6. Resources, Implementation, and Extensibility

ResearchRubrics releases all data and templates (101 prompts, 2,593 criteria, weights, evaluation code, judge-prompt scaffolding) at https://scale.com/research/researchrubrics for public use.

Research applications enabled include:

  • Systematic benchmarking of new DR agents under multi-axis, expert-vetted rubrics.
  • Dimensional analysis of agent failures (by complexity or rubric type).
  • Extension of rubric sets to novel domains.
  • Deployment of fully automated LLM-based judging or scalable human-in-the-loop evaluation.

7. Impact, Limitations, and Future Directions

ResearchRubrics establishes an expert-crafted, quantitatively precise, and extensible benchmark for deep research agent evaluation:

  • Scalable Measurement: Supports nuanced, multi-dimensional evaluation at scale for long-form, cross-document agent responses.
  • Diagnostic Utility: Provides insight into agent reasoning, factual grounding, synthesis shortcomings, and implicit inference gaps—information crucial for targeted system improvement.
  • Limitations: The benchmark utilizes single-turn prompts; evolving DR agent architectures (e.g., multi-agent, tool-augmented, dynamically interactive) may require methodological adaptation. The evaluation pipeline relies on LLM-based judges whose bias patterns require ongoing monitoring.

The release sets a new standard for evaluating evidence-backed reasoning and synthesis in research assistants, promoting rigorous, domain-diverse, and interpretable agent assessment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ResearchRubrics.