Papers
Topics
Authors
Recent
2000 character limit reached

ADR-Bench: Chinese Deep Research Eval

Updated 24 December 2025
  • ADR-Bench is a benchmark suite that evaluates long-horizon, open-ended deep research tasks in Chinese by leveraging diverse query sets and dual scoring protocols.
  • The evaluation protocol employs pairwise human comparisons and automated expert rubric scoring to assess metrics such as information completeness, content depth, and requirement fitness.
  • Ablation studies confirm that mid-training on atomic capability data significantly enhances performance, improving overall WinRate and response depth.

ADR-Bench is a benchmark suite designed to rigorously evaluate long-horizon, open-ended, deep research capabilities of LLMs, with a particular focus on real-user, cross-domain, and expert-driven tasks in the Chinese language. Developed in the context of autonomous agentic LLM research, it addresses substantial gaps in previous academic benchmarks for research-level tasks by introducing a structurally diverse query set, multi-faceted evaluation protocols, and robust rubrics that emphasize both breadth and task-specific depth (Hu et al., 23 Dec 2025).

1. Motivation, Context, and Gap Analysis

ADR-Bench emerged from a systematic analysis of existing “deep research” LLM benchmarks, such as BrowseComp and HLE, which primarily emphasize short-horizon, fact-based retrieval or closed-book examination modes. These benchmarks do not model the complex, multi-stage workflows, multi-source synthesis, and nuanced report writing representative of actual research processes. Further limitations included a strong English-language bias, insufficient task diversity, and the absence of scalable, user-centered evaluation frameworks yielding meaningful comparative signals.

To address these deficiencies, ADR-Bench was developed with three stated objectives:

  • Chinese-language coverage spanning both general and specialized domains;
  • Dual evaluation mechanisms: human preference-based and automated expert rubric scoring;
  • Use of pairwise (Elo-based) comparative methods to yield preference-aligned model rankings (Hu et al., 23 Dec 2025).

2. Dataset Construction and Domain Scope

ADR-Bench comprises 110 tasks partitioned across nine domains, reflecting both typical and niche research use cases. Specialized domains comprise Law (20 items) and Finance & Business (20 items), formulated and rubric-validated by domain specialists. Seven additional domains (Computer & IT, Education, Science & Engineering, Social Life, Literature & Arts, Healthcare, and Politics) contribute 10 items each, focusing on queries extracted from actual high-quality user logs.

All tasks are authored in Chinese and engineered as multi-stage “deep research” challenges. For each task, especially in specialized domains, a detailed, atomic, and verifiable rubric is provided to enable precise evaluation of the agent’s response.

Table 1. ADR-Bench Domain Distribution

Domain Items (Count)
Law 20
Finance & Business 20
Computer & IT 10
Education 10
Science & Engineering 10
Social Life 10
Literature & Arts 10
Healthcare 10
Politics 10

3. Task Typology and Evaluation Rubrics

Each ADR-Bench item mandates demonstration of “atomic capabilities” essential to open-ended research:

  • Intent understanding and task decomposition;
  • Long-horizon planning and decision-making;
  • Tool-mediated information gathering and multi-source verification;
  • Reflection and self-correction;
  • Synthesis into a structured, coherent report.

Two complementary evaluation regimens are utilized:

  • General-domain evaluation (70 items): Blind, pairwise human comparisons across four dimensions: Information Completeness, Content Depth, Requirement Fitness (alignment and accuracy), and Readability & Presentation. Scoring is ordinal and comparative (“left/right/both good/fair/poor”).
  • Specialized-domain evaluation (40 items): Each response is evaluated by an LLM judge using an expert-authored rubric, with each atomic requirement sis_i assigned a binary score and weight wiw_i. The total rubric score is

Srubric=i=1Nwisi,S^=Srubriciwi[0,1]S_{\rm rubric} = \sum_{i=1}^N w_i s_i, \qquad \hat S = \frac{S_{\rm rubric}}{\sum_i |w_i|} \in [0,1]

Strict zeroing penalties are implemented for “unpardonable” errors.

4. Evaluation Protocol

General-domain results are aggregated as (Wins, Ties, Losses) from comparative judgments and can be mapped to Elo ratings using:

EA=11+10(RBRA)/400E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}

with overall WinRate defined as

WinRate=#Wins#Wins+#Losses\text{WinRate} = \frac{\# \text{Wins}}{\# \text{Wins} + \# \text{Losses}}

Specialized-domain scores are absolute, with each item’s rubric total reported. All items are constrained by high objectivity and atomic granularity of requirements.

5. Key Experimental Results

Step-DeepResearch (StepDR), a medium-sized LLM agent trained with atomic-capability mid-training and progressive reinforcement, was benchmarked on ADR-Bench alongside state-of-the-art (SOTA) closed-source and open-source models.

Table 2. General-Domain, Pairwise Win/Tie/Loss

Comparison Wins Ties Losses
StepDR vs. no-midtrain 30 19 21
StepDR vs. Gemini 22 12 36
StepDR vs. OpenAI DR 25 11 34

StepDR achieves a cumulative Win+Tie rate of 67.1% against top-tier closed-source agents and a direct WinRate of approximately 63% in the general-domain subset.

In specialized domains, rubric-based expert LLM evaluation stratifies performance as follows:

Table 3. Specialized-Domain (Finance & Law) Tiers

Tier Score Range Systems
1 25–35 Gemini DeepResearch
2 15–25 Step-DeepResearch, Kimi, K2, OpenAI DR
3 0–15 Qwen DR, MiniMax-M2/Pro, GLM-4.6

Fine-grained human dimension-wise WinRates for StepDR are: Information Completeness 58%, Content Depth 54%, Requirement Fitness 61%, Presentation 57%.

6. Ablation Studies and Qualitative Analysis

Ablative studies demonstrate that mid-training on atomic capability data in a 32K-context window is critical for high performance. Its removal reduces WinRate from 63% to 48%, with major declines in Completeness and Depth. Qualitative case studies reveal that StepDR with full mid-training produces more comprehensive, intent-aligned, tabular analyses across more works per query compared to ablations.

In temporal reasoning, explicit filtering for hallucinated past-timestamp content improved timeliness by 20%. Additionally, refinement of writing style (minimizing fragmentary lists) increased Content Depth ratings by +4%.

Table 4. Case Study: Research Task Coverage

Dimension Full StepDR w/o Mid-Training
Fitness Full, prioritized pipelines Partial, misses key pipelines
Completeness 12+ works listed 6 works listed
Depth Concrete code/tables Narrative only

7. Significance and Distinguishing Features

ADR-Bench delivers a multi-faceted, Chinese deep research evaluation platform superior in practical coverage, domain diversity, and evaluative rigor when compared with prior work. Its dual evaluation strategy—combining comparative human judgments with atomic rubric-based expert scoring—yields robust, actionable insights for LLM development. ADR-Bench is the first to (a) supply a large-scale, real-user, open-ended deep research test suite in Chinese, (b) close the evaluation-user preference gap via Elo-style pairwise aggregation, and (c) quantitatively demonstrate the efficacy of atomic capability pre-training for autonomous LLM researchers (Hu et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ADR-Bench.