ResearchRubrics: Benchmark for Evaluating DR Agents

Updated 12 November 2025

ResearchRubrics is a structured framework that uses expert-crafted, multi-axis rubrics to evaluate deep research agents on complex, evidence-backed queries.
It organizes evaluation criteria into six key axes including synthesis, explicit requirements, and communication quality to ensure comprehensive assessment.
It employs a multi-axis complexity framework and LLM-as-judge grading to diagnose performance gaps, guiding improvements in state-of-the-art DR agents.

A research rubric is a structured, criterion-referenced framework for evaluating the quality and completeness of open-ended responses by LLM–driven agents on complex research queries. ResearchRubrics (Sharma et al., 10 Nov 2025) is a rigorously engineered benchmark for this purpose, pairing realistic, domain-diverse prompts with thousands of meticulously constructed expert rubrics and supporting a reproducible, multi-axis evaluation pipeline for Deep Research (DR) agents.

1. Definition and Benchmark Scope

ResearchRubrics operationalizes the evaluation of Deep Research agents—autonomous LLM-based systems designed to synthesize evidence and generate long-form, multi-source, evidence-backed answers. Unlike standard QA or short-form assessment, DR tasks require aggregation of knowledge across documents, multi-step reasoning, and explicit citation of supporting evidence. ResearchRubrics directly addresses deficiencies in previous benchmarks that favor short, atomistic outputs or rely on LLM-generated/self-referential rubrics and static corpora, which are poorly suited to the dynamic and subjective nature of open-ended research tasks.

The benchmark comprises:

101 open-ended, single-turn prompts spanning nine categories: business, technical, consumer, historical, creative, current events, AI/ML, STEM, and hypotheticals/philosophy.
Each prompt is accompanied by a set of 20–43 human-written rubric criteria (average 25.7 per task; total 2,593), independently reviewed and iteratively refined by three distinct STEM-trained experts, with no LLM seeding.

2. Rubric Construction and Organization

Human Annotation Protocol and Weighting Scheme

Rubric construction adheres to a strict protocol:

Criteria authoring is entirely human-led, with each rubric iteratively reviewed (draft → peer → independent review).
Every criterion is classified as either Mandatory (±4 or ±5 weight) or Optional (±1 to ±3). Weights span {–5,…, +5} and are mapped to a six-level preference scale (from Critically Detrimental to Critically Important).
Negative weights penalize errors (e.g., factual inaccuracies, off-topic content); positive weights reward relevance and correctness.

Rubric Axes

Each prompt’s criteria are partitioned into six high-level axes:

Explicit Requirements: Addressing all points directly stated in the prompt.
Implicit Requirements: Covering points expected by knowledgeable readers (side effects, risks, cost, etc.).
Synthesis of Information: Integration across sources, penalizing pure listing.
Use of References: Citation specificity, correctness, and contextual relevance.
Communication Quality: Clarity, structure, tone, and audience fit.
Instruction Following: Adherence to explicit constraints (format, exclusions).

This organization enables granular failure-mode analysis and supports interpretability for both agent developers and research users.

3. Complexity Framework for Task Characterization

To systematically categorize the deepness of DR prompts, ResearchRubrics introduces a three-axis complexity framework:

Complexity Axis	Levels
Conceptual Breadth	Simple (1 domain) * Moderate (2–5 subtopics) * High (>5 disjoint domains)
Logical Nesting Depth	Shallow (1-step) * Intermediate (2–3 nested) * Deep (≥4 hierarchical steps)
Exploration	Low (fully specified) * Medium (1–2 ambiguous factors) * High (≥3, require framing)

Prompt lengths of 13–315 words (mean ≈88) reflect graded complexity, with longer prompts corresponding to higher multidimensional task challenge along these axes.

4. Scoring and Evaluation Protocols

LLM-as-Judge Grading

For each agent response and rubric criterion $r_i$ , graders assign: $m_{r_i} \in \{0 \text{ (Not Satisfied)},\ 0.5 \text{ (Partially)},\ 1 \text{ (Satisfied)}\}$ The aggregate normalized score for a task $S_k$ is: $S_k = \frac{\sum_{r_i \in C} w_{r_i} \cdot m_{r_i}}{\sum_{r_i \in C,\,w_{r_i}>0} w_{r_i}},$ where $w_{r_i}$ is the criterion’s weight and $C$ is the full set of criteria on the prompt.

Fine-Grained Failure Analysis

Failure rates are decomposed by category: $\overline{F}_c = \frac{1}{|T_c|} \sum_{t\in T_c} \frac{n_{fail,c,t}}{n_{fail,t}}$ where $n_{fail,c,t}$ is the count of Not Satisfied for category $c$ in task $t$ , $n_{fail,t}$ total failures, and $T_c$ the set of tasks including $c$ .

Human–Model Alignment

Agreement between human and LLM-based judges is measured by Macro- $F_1$ per rubric label; collapsing “Partially” to “Not” raises Macro- $F_1$ to ≈0.76, improving alignment by ≈20 percentage points.

5. Experimental Evaluation and Failure Modes

System Performance

System (Ternary)	Gemini DR	OpenAI DR	Perplexity DR
Avg Score	0.677	0.664	0.566
Strict (Binary)	0.615	0.597	0.487

On average, even state-of-the-art DR agents comply with <68% of rubric criteria, with leading-edge performance notably limited by implicit context and synthesis reasoning.

Diagnostic Insights

Failure Decomposition: Implicit Reasoning and Synthesis account for ~45–50% of failures; explicit and communication axes <20%.
Criteria Impact: Failures among mandatory criteria drive explicit and synthesis deficits; optional criteria relate mainly to missed implicit requirements.
Complexity Sensitivity: Performance degrades monotonically with logical depth (−20–25 points from shallow to deep tasks), increased conceptual breadth, and more exploration/ambiguity.
Length and Citation Tradeoffs: Longer responses correlate modestly (r≈0.24–0.28) with rubric satisfaction. High breadth (more citations: Gemini ≈111@81% accuracy vs. Perplexity ≈31@90%) cannot be simultaneously achieved with high citation precision.

6. Resources, Implementation, and Extensibility

ResearchRubrics releases all data and templates (101 prompts, 2,593 criteria, weights, evaluation code, judge-prompt scaffolding) at https://scale.com/research/researchrubrics for public use.

Research applications enabled include:

Systematic benchmarking of new DR agents under multi-axis, expert-vetted rubrics.
Dimensional analysis of agent failures (by complexity or rubric type).
Extension of rubric sets to novel domains.
Deployment of fully automated LLM-based judging or scalable human-in-the-loop evaluation.

7. Impact, Limitations, and Future Directions

ResearchRubrics establishes an expert-crafted, quantitatively precise, and extensible benchmark for deep research agent evaluation:

Scalable Measurement: Supports nuanced, multi-dimensional evaluation at scale for long-form, cross-document agent responses.
Diagnostic Utility: Provides insight into agent reasoning, factual grounding, synthesis shortcomings, and implicit inference gaps—information crucial for targeted system improvement.
Limitations: The benchmark utilizes single-turn prompts; evolving DR agent architectures (e.g., multi-agent, tool-augmented, dynamically interactive) may require methodological adaptation. The evaluation pipeline relies on LLM-based judges whose bias patterns require ongoing monitoring.

The release sets a new standard for evaluating evidence-backed reasoning and synthesis in research assistants, promoting rigorous, domain-diverse, and interpretable agent assessment.

PDF Markdown Chat (Pro)

References (1)

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ResearchRubrics.