RACE: Adaptive Reference-Based Evaluation Framework

Updated 9 February 2026

RACE is an adaptive framework that generates dynamic evaluation criteria and weights for assessing long-form LLM research outputs.
It employs a multi-phase methodology including dimension weighting, criterion generation, and reference scoring to benchmark candidate reports.
Empirical evaluations show RACE’s robust alignment with human judgments, outperforming static and ablated scoring baselines.

RACE (Reference-based Adaptive Criteria-driven Evaluation with Dynamic Weighting) is an automated framework designed to deliver human-aligned and task-sensitive evaluation of long-form research outputs generated by LLM-based research agents. RACE adaptively determines which quality dimensions are most critical for each task, dynamically generates granular sub-criteria, and evaluates candidate reports relative to validated high-quality references. Its distinguishing capability lies in leveraging both adaptive criteria generation and dynamic weighting to address the complexity and variability of open-ended research tasks, providing a robust metric that aligns well with expert human judgment (Du et al., 13 Jun 2025).

1. Formalization and Core Principles

RACE is explicitly defined to meet the need for principled, fair, and reproducible assessment of long-form outputs in scenarios where conventional static rubrics or binary checklists fail to capture granular differences in quality. The objectives are threefold: (a) Infer dimension weights that reflect the relative importance of distinct quality aspects (“Comprehensiveness,” “Insight/Depth,” “Instruction-Following,” “Readability”) for each research task, (b) Generate task-specific evaluation criteria within those dimensions, each with an internal sub-weight, and (c) Produce a holistic, relative quality score by benchmarking the candidate report against a carefully curated reference solution. This tri-level adaptivity ensures both domain alignment and discrimination between nuanced research agent outputs.

2. Mathematical Structure

Let $t$ denote the specific research task, $R_{\mathrm{tgt}}$ the candidate (target) research report, and $R_{\mathrm{ref}}$ a corresponding high-quality reference. RACE proceeds through the following sequential stages:

a. Dimension Weighting:

For each of four orthogonal dimensions $d \in \{\mathrm{Comp}, \mathrm{Depth}, \mathrm{Inst}, \mathrm{Read}\}$ , $T$ independent LLM-based judgments yield weights $w_d^{(j)}$ . Final dimension weights are

$W_d = \frac{1}{T} \sum_{j=1}^T w_d^{(j)}, \qquad \sum_d W_d = 1.$

b. Criterion Generation and Scoring:

For each dimension $d$ , the Judge LLM generates $K_d$ task-specific criteria $\{c_{d,k}\}$ with associated weights $w_{d,k}$ , normalized such that

$\sum_{k=1}^{K_d} w_{d,k} = 1.$

The union of criteria across all dimensions forms the set of task criteria $\mathcal{C}_t$ . Each report $R$ obtains a score $s_{R, c}$ for every $c \in \mathcal{C}_t$ .

c. Aggregation:

For each dimension, aggregate report scores as

$S_d(R) = \sum_{k=1}^{K_d} w_{d,k}\, s_{R, c_{d,k}}.$

Global integration across dimensions:

$S_{\mathrm{int}}(R) = \sum_d W_d S_d(R).$

The normalized relative score, expressing target quality against the reference, is:

$S_{\mathrm{final}}(R_{\mathrm{tgt}}) = \frac{S_{\mathrm{int}}(R_{\mathrm{tgt}})}{S_{\mathrm{int}}(R_{\mathrm{tgt}}) + S_{\mathrm{int}}(R_{\mathrm{ref}})}, \quad S_{\mathrm{final}} \in [0, 1].$

3. Reference Report Selection and Scoring Protocol

Reference reports $R_{\mathrm{ref}}$ originate from a rigorously validated corpus, specifically Gemini-2.5-Pro Deep Research outputs (April 2025 snapshot), which were independently verified for quality. Before evaluation, citation markers are removed via a cleaning prompt, ensuring that the Judge LLM does not anchor scoring on superficial citation frequency or format. Both $R_{\mathrm{tgt}}$ and $R_{\mathrm{ref}}$ , along with task description and generated criteria, are simultaneously presented to the Judge LLM in a single prompt, ensuring direct pairwise criterion-level scoring.

4. Algorithmic Workflow

The RACE framework operates as follows:

Input: Task t, target report R_tgt, reference report R_ref,
       JudgeLLM model, #trials T.
Output: Relative quality score S_final(R_tgt).

1. Dimension Weighting:
   For j in 1..T:
       Prompt JudgeLLM to assign weights w_d^(j) for d in {Comp, Depth, Inst, Read}.
   Compute W_d = (1/T) * sum_j w_d^(j).

2. Criterion Generation:
   For each dimension d:
       Prompt JudgeLLM to produce K_d criteria {c_{d,k}} with weights w_{d,k}, sum to 1.

3. Reference-based Scoring:
   Build full criterion list C_t = union of all c_{d,k}.
   Prompt JudgeLLM to score (R_tgt, R_ref) on each c in C_t,
       yielding {s_{tgt,c}} and {s_{ref,c}}.

4. Score Aggregation:
   For any report R (tgt or ref):
       For each d: S_d(R) = sum_k w_{d,k} * s_{R, c_{d,k}}.
       S_int(R) = sum_d W_d * S_d(R).
   S_final(R_tgt) = S_int(R_tgt) / [S_int(R_tgt) + S_int(R_ref)].

Return S_final(R_tgt)

5. Hyperparameters and Practical Considerations

Key tunable parameters include:

$T$ (number of weight-inference trials), typically $T=5$ to balance stability and compute cost,
$K_d$ (criteria per dimension), generally 3–5, adaptively specified via Judge LLM prompting,
LLM choice and cost-performance tradeoff: Gemini-2.5-Pro for RACE, Gemini-2.5-Flash for related FACT benchmarking,
Prompt temperature is kept low to minimize variability, and max token limits are strictly set to ensure uniform LLM behavior.

The scoring pipeline is governed by published prompt templates for dimension weighting, criterion generation, citation cleaning, and score collection.

6. Case Studies and Empirical Performance

Exemplar tasks, such as evaluating the "feasibility of investing in EV charging infrastructure," illustrate RACE’s workflow: Judge LLM assigns nuanced global weights (e.g., Insight 0.35, Comprehensiveness 0.30), generates granular criteria (e.g., “Market Demand Analysis,” “Regulatory Environment Coverage”), and appropriately sub-weights them within the dimensions. Quantitative experiments based on 50 Chinese-language tasks revealed RACE achieves a Pairwise Agreement Rate (PAR) of 71.33% with human judgments, surpassing average inter-human agreement of 68.44%. In the main DeepResearch Bench leaderboard, RACE places Gemini-2.5-Pro Deep Research at an overall score of 48.88—slightly ahead of OpenAI Deep Research (46.98).

RACE was found to outperform ablation baselines lacking criteria weights, dimension weights, or reference-based benchmarking. Correspondence with human assessments was validated across Pearson (99.54% vs. human, higher than vanilla LLM: 98.9%) and Spearman (59.12% vs. vanilla: 43.8%) metrics. These results demonstrate both robust overall and discriminative agreement with expert evaluations.

7. Implementation, Reproducibility, and Open Resources

RACE is fully open source, with code, reference reports, evaluation scripts, and prompt templates provided at https://github.com/Ayanami0730/deep_research_bench. The Judge LLM and workflow API configurations are explicitly documented to support consistent reproduction. All referenced hyperparameters, scoring templates, and task loader mechanisms are accessible. The modular pipeline facilitates extension, embedding into agent training, or further meta-evaluations of LLM-based research workflows (Du et al., 13 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RACE (Reference-Based Adaptive Criteria Evaluation).