ADR-Bench: Chinese Deep Research Eval
- ADR-Bench is a benchmark suite that evaluates long-horizon, open-ended deep research tasks in Chinese by leveraging diverse query sets and dual scoring protocols.
- The evaluation protocol employs pairwise human comparisons and automated expert rubric scoring to assess metrics such as information completeness, content depth, and requirement fitness.
- Ablation studies confirm that mid-training on atomic capability data significantly enhances performance, improving overall WinRate and response depth.
ADR-Bench is a benchmark suite designed to rigorously evaluate long-horizon, open-ended, deep research capabilities of LLMs, with a particular focus on real-user, cross-domain, and expert-driven tasks in the Chinese language. Developed in the context of autonomous agentic LLM research, it addresses substantial gaps in previous academic benchmarks for research-level tasks by introducing a structurally diverse query set, multi-faceted evaluation protocols, and robust rubrics that emphasize both breadth and task-specific depth (Hu et al., 23 Dec 2025).
1. Motivation, Context, and Gap Analysis
ADR-Bench emerged from a systematic analysis of existing “deep research” LLM benchmarks, such as BrowseComp and HLE, which primarily emphasize short-horizon, fact-based retrieval or closed-book examination modes. These benchmarks do not model the complex, multi-stage workflows, multi-source synthesis, and nuanced report writing representative of actual research processes. Further limitations included a strong English-language bias, insufficient task diversity, and the absence of scalable, user-centered evaluation frameworks yielding meaningful comparative signals.
To address these deficiencies, ADR-Bench was developed with three stated objectives:
- Chinese-language coverage spanning both general and specialized domains;
- Dual evaluation mechanisms: human preference-based and automated expert rubric scoring;
- Use of pairwise (Elo-based) comparative methods to yield preference-aligned model rankings (Hu et al., 23 Dec 2025).
2. Dataset Construction and Domain Scope
ADR-Bench comprises 110 tasks partitioned across nine domains, reflecting both typical and niche research use cases. Specialized domains comprise Law (20 items) and Finance & Business (20 items), formulated and rubric-validated by domain specialists. Seven additional domains (Computer & IT, Education, Science & Engineering, Social Life, Literature & Arts, Healthcare, and Politics) contribute 10 items each, focusing on queries extracted from actual high-quality user logs.
All tasks are authored in Chinese and engineered as multi-stage “deep research” challenges. For each task, especially in specialized domains, a detailed, atomic, and verifiable rubric is provided to enable precise evaluation of the agent’s response.
Table 1. ADR-Bench Domain Distribution
| Domain | Items (Count) |
|---|---|
| Law | 20 |
| Finance & Business | 20 |
| Computer & IT | 10 |
| Education | 10 |
| Science & Engineering | 10 |
| Social Life | 10 |
| Literature & Arts | 10 |
| Healthcare | 10 |
| Politics | 10 |
3. Task Typology and Evaluation Rubrics
Each ADR-Bench item mandates demonstration of “atomic capabilities” essential to open-ended research:
- Intent understanding and task decomposition;
- Long-horizon planning and decision-making;
- Tool-mediated information gathering and multi-source verification;
- Reflection and self-correction;
- Synthesis into a structured, coherent report.
Two complementary evaluation regimens are utilized:
- General-domain evaluation (70 items): Blind, pairwise human comparisons across four dimensions: Information Completeness, Content Depth, Requirement Fitness (alignment and accuracy), and Readability & Presentation. Scoring is ordinal and comparative (“left/right/both good/fair/poor”).
- Specialized-domain evaluation (40 items): Each response is evaluated by an LLM judge using an expert-authored rubric, with each atomic requirement assigned a binary score and weight . The total rubric score is
Strict zeroing penalties are implemented for “unpardonable” errors.
4. Evaluation Protocol
General-domain results are aggregated as (Wins, Ties, Losses) from comparative judgments and can be mapped to Elo ratings using:
with overall WinRate defined as
Specialized-domain scores are absolute, with each item’s rubric total reported. All items are constrained by high objectivity and atomic granularity of requirements.
5. Key Experimental Results
Step-DeepResearch (StepDR), a medium-sized LLM agent trained with atomic-capability mid-training and progressive reinforcement, was benchmarked on ADR-Bench alongside state-of-the-art (SOTA) closed-source and open-source models.
Table 2. General-Domain, Pairwise Win/Tie/Loss
| Comparison | Wins | Ties | Losses |
|---|---|---|---|
| StepDR vs. no-midtrain | 30 | 19 | 21 |
| StepDR vs. Gemini | 22 | 12 | 36 |
| StepDR vs. OpenAI DR | 25 | 11 | 34 |
StepDR achieves a cumulative Win+Tie rate of 67.1% against top-tier closed-source agents and a direct WinRate of approximately 63% in the general-domain subset.
In specialized domains, rubric-based expert LLM evaluation stratifies performance as follows:
Table 3. Specialized-Domain (Finance & Law) Tiers
| Tier | Score Range | Systems |
|---|---|---|
| 1 | 25–35 | Gemini DeepResearch |
| 2 | 15–25 | Step-DeepResearch, Kimi, K2, OpenAI DR |
| 3 | 0–15 | Qwen DR, MiniMax-M2/Pro, GLM-4.6 |
Fine-grained human dimension-wise WinRates for StepDR are: Information Completeness 58%, Content Depth 54%, Requirement Fitness 61%, Presentation 57%.
6. Ablation Studies and Qualitative Analysis
Ablative studies demonstrate that mid-training on atomic capability data in a 32K-context window is critical for high performance. Its removal reduces WinRate from 63% to 48%, with major declines in Completeness and Depth. Qualitative case studies reveal that StepDR with full mid-training produces more comprehensive, intent-aligned, tabular analyses across more works per query compared to ablations.
In temporal reasoning, explicit filtering for hallucinated past-timestamp content improved timeliness by 20%. Additionally, refinement of writing style (minimizing fragmentary lists) increased Content Depth ratings by +4%.
Table 4. Case Study: Research Task Coverage
| Dimension | Full StepDR | w/o Mid-Training |
|---|---|---|
| Fitness | Full, prioritized pipelines | Partial, misses key pipelines |
| Completeness | 12+ works listed | 6 works listed |
| Depth | Concrete code/tables | Narrative only |
7. Significance and Distinguishing Features
ADR-Bench delivers a multi-faceted, Chinese deep research evaluation platform superior in practical coverage, domain diversity, and evaluative rigor when compared with prior work. Its dual evaluation strategy—combining comparative human judgments with atomic rubric-based expert scoring—yields robust, actionable insights for LLM development. ADR-Bench is the first to (a) supply a large-scale, real-user, open-ended deep research test suite in Chinese, (b) close the evaluation-user preference gap via Elo-style pairwise aggregation, and (c) quantitatively demonstrate the efficacy of atomic capability pre-training for autonomous LLM researchers (Hu et al., 23 Dec 2025).