ResearchRubrics: Benchmark for Evaluating DR Agents
- ResearchRubrics is a structured framework that uses expert-crafted, multi-axis rubrics to evaluate deep research agents on complex, evidence-backed queries.
- It organizes evaluation criteria into six key axes including synthesis, explicit requirements, and communication quality to ensure comprehensive assessment.
- It employs a multi-axis complexity framework and LLM-as-judge grading to diagnose performance gaps, guiding improvements in state-of-the-art DR agents.
A research rubric is a structured, criterion-referenced framework for evaluating the quality and completeness of open-ended responses by LLM–driven agents on complex research queries. ResearchRubrics (Sharma et al., 10 Nov 2025) is a rigorously engineered benchmark for this purpose, pairing realistic, domain-diverse prompts with thousands of meticulously constructed expert rubrics and supporting a reproducible, multi-axis evaluation pipeline for Deep Research (DR) agents.
1. Definition and Benchmark Scope
ResearchRubrics operationalizes the evaluation of Deep Research agents—autonomous LLM-based systems designed to synthesize evidence and generate long-form, multi-source, evidence-backed answers. Unlike standard QA or short-form assessment, DR tasks require aggregation of knowledge across documents, multi-step reasoning, and explicit citation of supporting evidence. ResearchRubrics directly addresses deficiencies in previous benchmarks that favor short, atomistic outputs or rely on LLM-generated/self-referential rubrics and static corpora, which are poorly suited to the dynamic and subjective nature of open-ended research tasks.
The benchmark comprises:
- 101 open-ended, single-turn prompts spanning nine categories: business, technical, consumer, historical, creative, current events, AI/ML, STEM, and hypotheticals/philosophy.
- Each prompt is accompanied by a set of 20–43 human-written rubric criteria (average 25.7 per task; total 2,593), independently reviewed and iteratively refined by three distinct STEM-trained experts, with no LLM seeding.
2. Rubric Construction and Organization
Human Annotation Protocol and Weighting Scheme
Rubric construction adheres to a strict protocol:
- Criteria authoring is entirely human-led, with each rubric iteratively reviewed (draft → peer → independent review).
- Every criterion is classified as either Mandatory (±4 or ±5 weight) or Optional (±1 to ±3). Weights span {–5,…, +5} and are mapped to a six-level preference scale (from Critically Detrimental to Critically Important).
- Negative weights penalize errors (e.g., factual inaccuracies, off-topic content); positive weights reward relevance and correctness.
Rubric Axes
Each prompt’s criteria are partitioned into six high-level axes:
- Explicit Requirements: Addressing all points directly stated in the prompt.
- Implicit Requirements: Covering points expected by knowledgeable readers (side effects, risks, cost, etc.).
- Synthesis of Information: Integration across sources, penalizing pure listing.
- Use of References: Citation specificity, correctness, and contextual relevance.
- Communication Quality: Clarity, structure, tone, and audience fit.
- Instruction Following: Adherence to explicit constraints (format, exclusions).
This organization enables granular failure-mode analysis and supports interpretability for both agent developers and research users.
3. Complexity Framework for Task Characterization
To systematically categorize the deepness of DR prompts, ResearchRubrics introduces a three-axis complexity framework:
| Complexity Axis | Levels |
|---|---|
| Conceptual Breadth | Simple (1 domain) * Moderate (2–5 subtopics) * High (>5 disjoint domains) |
| Logical Nesting Depth | Shallow (1-step) * Intermediate (2–3 nested) * Deep (≥4 hierarchical steps) |
| Exploration | Low (fully specified) * Medium (1–2 ambiguous factors) * High (≥3, require framing) |
Prompt lengths of 13–315 words (mean ≈88) reflect graded complexity, with longer prompts corresponding to higher multidimensional task challenge along these axes.
4. Scoring and Evaluation Protocols
LLM-as-Judge Grading
For each agent response and rubric criterion , graders assign: The aggregate normalized score for a task is: where is the criterion’s weight and is the full set of criteria on the prompt.
Fine-Grained Failure Analysis
Failure rates are decomposed by category: where is the count of Not Satisfied for category in task , total failures, and the set of tasks including .
Human–Model Alignment
Agreement between human and LLM-based judges is measured by Macro- per rubric label; collapsing “Partially” to “Not” raises Macro- to ≈0.76, improving alignment by ≈20 percentage points.
5. Experimental Evaluation and Failure Modes
System Performance
| System (Ternary) | Gemini DR | OpenAI DR | Perplexity DR |
|---|---|---|---|
| Avg Score | 0.677 | 0.664 | 0.566 |
| Strict (Binary) | 0.615 | 0.597 | 0.487 |
On average, even state-of-the-art DR agents comply with <68% of rubric criteria, with leading-edge performance notably limited by implicit context and synthesis reasoning.
Diagnostic Insights
- Failure Decomposition: Implicit Reasoning and Synthesis account for ~45–50% of failures; explicit and communication axes <20%.
- Criteria Impact: Failures among mandatory criteria drive explicit and synthesis deficits; optional criteria relate mainly to missed implicit requirements.
- Complexity Sensitivity: Performance degrades monotonically with logical depth (−20–25 points from shallow to deep tasks), increased conceptual breadth, and more exploration/ambiguity.
- Length and Citation Tradeoffs: Longer responses correlate modestly (r≈0.24–0.28) with rubric satisfaction. High breadth (more citations: Gemini ≈111@81% accuracy vs. Perplexity ≈31@90%) cannot be simultaneously achieved with high citation precision.
6. Resources, Implementation, and Extensibility
ResearchRubrics releases all data and templates (101 prompts, 2,593 criteria, weights, evaluation code, judge-prompt scaffolding) at https://scale.com/research/researchrubrics for public use.
Research applications enabled include:
- Systematic benchmarking of new DR agents under multi-axis, expert-vetted rubrics.
- Dimensional analysis of agent failures (by complexity or rubric type).
- Extension of rubric sets to novel domains.
- Deployment of fully automated LLM-based judging or scalable human-in-the-loop evaluation.
7. Impact, Limitations, and Future Directions
ResearchRubrics establishes an expert-crafted, quantitatively precise, and extensible benchmark for deep research agent evaluation:
- Scalable Measurement: Supports nuanced, multi-dimensional evaluation at scale for long-form, cross-document agent responses.
- Diagnostic Utility: Provides insight into agent reasoning, factual grounding, synthesis shortcomings, and implicit inference gaps—information crucial for targeted system improvement.
- Limitations: The benchmark utilizes single-turn prompts; evolving DR agent architectures (e.g., multi-agent, tool-augmented, dynamically interactive) may require methodological adaptation. The evaluation pipeline relies on LLM-based judges whose bias patterns require ongoing monitoring.
The release sets a new standard for evaluating evidence-backed reasoning and synthesis in research assistants, promoting rigorous, domain-diverse, and interpretable agent assessment.