DeepResearch-9K: Deep Web Exploration Benchmark
- DeepResearch-9K is a benchmark designed for LLM-based deep-research agents to autonomously perform long-horizon web exploration and multi-step reasoning.
- It utilizes a four-stage synthesis pipeline with progressive entity obfuscation and controlled reasoning-chain depth to create three distinct difficulty levels (L1–L3).
- Integrated with the DeepResearch-R1 framework, it supports reinforcement learning paradigms and teacher-guided trajectory supervision for verifiable answer production.
DeepResearch-9K is a benchmark and training resource for deep-research agents: LLM-based systems that autonomously perform long-horizon web exploration, repeated tool use, and multi-step reasoning to answer difficult, obfuscated questions. Introduced together with the DeepResearch-R1 framework, it contains 9,000 questions spanning three difficulty levels from L1 to L3, high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, and verifiable answers (Wu et al., 1 Mar 2026).
1. Definition and task model
In the DeepResearch-9K formulation, a deep-research agent is an LLM-based agent that interacts with the live or simulated web via search tools, repeatedly and adaptively; breaks a complex, often obfuscated query into subproblems and plans multi-step search trajectories; integrates evidence across multiple sources, verifies hypotheses, and converges on a precise, verifiable answer; and maintains an explicit internal reasoning chain via <Think> blocks that guides external actions (Wu et al., 1 Mar 2026).
This task framing is more demanding than standard multi-hop QA over a static corpus. In datasets such as HotpotQA or MuSiQue, a system can often succeed with a small number of retrieval steps from a fixed Wikipedia-like collection. DeepResearch-9K instead targets open, noisy, and incomplete search environments in which agents must cope with ambiguous descriptions, missing explicit names, and long chains of related entities spread across many pages (Wu et al., 1 Mar 2026).
The benchmark is also explicitly tied to tool-use difficulty. Difficulty is not treated as a purely semantic property of the question; it is operationalized through reasoning-chain depth, entity obfuscation, and the number of required search queries. This design places DeepResearch-9K in the same general family as newer deep-research benchmarks, but with an explicit emphasis on trajectory-supervised agent training and verifiable answer production (Wu et al., 1 Mar 2026).
2. Autonomous construction pipeline and difficulty structure
DeepResearch-9K is built from three open-source multi-hop QA datasets: 2WikiMultihopQA, HotpotQA, and MuSiQue. The construction process samples 1,000 instances from each source dataset, producing an initial pool of 3,000 seeds. Using DeepSeek-V3, the pipeline extracts key entities from questions and contexts—names, locations, dates, and other salient identifiers—and assembles them into a multi-hop relational graph from which paths are selected to control reasoning horizon and difficulty (Wu et al., 1 Mar 2026).
The synthesis pipeline has four stages. First, seed entity identification and extraction builds the entity graph. Second, progressive reasoning chain construction defines the task structure for L1, L2, and L3. Third, progressive entity obfuscation replaces explicit names, dates, and locations with increasingly indirect descriptions. Fourth, rule-based quality assurance enforces construction guidelines, chaining constraints, and format consistency. The reported construction cost is approximately $200 in Serper API fees, plus approximately 8,064 A100 hours for teacher-agent generation and evaluation (Wu et al., 1 Mar 2026).
The three difficulty levels are structurally distinct rather than merely graded by answer rarity.
| Level | Construction pattern | Average tool calls |
|---|---|---|
| L1 | Direct Attribute Mapping; light obfuscation | 4.30 |
| L2 | Multi-hop Relational Bridging with chain $A \rightarrow B \rightarrow CCABA \rightarrow B \rightarrow C \rightarrow D \rightarrow E \rightarrow \text{Target}$, imposes a hard link rule requiring a new independent search query at each hop, and enforces that no single knowledge source may contain more than two consecutive entities (Wu et al., 1 Mar 2026).
Obfuscation likewise scales by level. L1 uses light substitution such as replacing “Beijing” with “the capital of China.” L2 prohibits proper names, exact dates, and specific locations for 1–2 critical entities. L3 applies obfuscation across the entire narrative, using no proper names, no exact dates, and no exact locations, while encoding the final question as a dense paragraph with nested descriptions. This suggests that DeepResearch-9K is designed not merely to test retrieval breadth, but to test semantic decoding of indirect referents under web-search constraints (Wu et al., 1 Mar 2026). 3. Trajectories, verifiable answers, and evaluation protocolEach DeepResearch-9K instance includes a search trajectory with reasoning chains generated by Tongyi-DeepResearch-30B-A3B. These trajectories are sequences of Final answers are specific, concrete entities and are verified with an LLM-as-judge mechanism using DeepSeek-V3. Correctness is determined by semantic equivalence rather than exact string match. The same judging paradigm is used in evaluation, where the principal metrics are Accuracy and Search Count. Accuracy is the proportion of questions whose final answers are judged correct by DeepSeek-V3, and Search Count is the number of search-tool invocations in the trajectory (Wu et al., 1 Mar 2026). Teacher-model results establish the dataset’s difficulty profile. Tongyi-DeepResearch-30B-A3B achieves 72.47% on L1, 71.33% on L2, 23.73% on L3, and 55.84% overall. On BrowseComp-Plus, the same teacher achieves 24.94%. The near match between L3 accuracy and BrowseComp-Plus accuracy is used to argue that L3 approximates the difficulty of a hard public deep-research benchmark (Wu et al., 1 Mar 2026). The dataset also defines a hard subset, DeepResearch-Hard, containing 3,974 questions on which Tongyi-DeepResearch-30B-A3B fails. This subset isolates the region where current strong agents remain unreliable, especially on long chains of obfuscated entities, situations with multiple partially matching candidates, and tasks where evidence is dispersed across many pages (Wu et al., 1 Mar 2026). 4. DeepResearch-R1: open training frameworkDeepResearch-R1 is the accompanying open-source training framework for deep-research agents. It is built on top of Search-R1 and is designed to support multi-turn web interactions, different reinforcement learning approaches, and different reward models such as rule-based outcome reward and LLM-as-judge feedback (Wu et al., 1 Mar 2026). The framework models an episode as a multi-turn sequence in which the environment presents a question, the agent emits a DeepResearch-R1 supports PPO and GRPO. The paper describes PPO with the standard clipped policy-gradient objective and characterizes GRPO as a group-relative variant that normalizes rewards within groups or batches. In practice, DeepSeek-V3 functions as the main reward model, judging the correctness of final answers and converting that judgment into scalar rewards for RL (Wu et al., 1 Mar 2026). Two training paradigms are emphasized. In Zero-RL, the system starts from a base model such as Qwen2.5-3B or Llama3.2-3B and applies RL directly on the DeepResearch-9K training set. In SFT+RL, the framework first performs supervised fine-tuning on teacher trajectories and then refines the policy with RL. The paper reports 5,026 teacher-correct trajectories; adds a random subset of 2,200 incorrect samples to improve RL robustness; uses 7,226 instances for training; and reserves the remaining 1,774 as a test split (Wu et al., 1 Mar 2026). This design makes DeepResearch-9K atypical among evaluation-only benchmarks. It is not only a testbed but also a trajectory-bearing training corpus with an explicit RL environment. A plausible implication is that the dataset is intended to study how search policies, reasoning traces, and reward design interact, rather than only how a frozen model scores on final-answer accuracy. 5. Empirical findings and training implicationsThe evaluation on the 1,774-question DeepResearch-9K test split shows that difficulty remains substantial even after training. DeepSeek-V3, used here as an inference-only baseline, reaches 20.18% accuracy. For Qwen2.5-3B, Zero-RL gives approximately 12–13% accuracy, while SFT+RL pushes performance above 20%. For Llama3.2-3B, Zero-RL with PPO reaches 22.50%, the best reported result among the student-model settings, while GRPO and SFT+RL variants are around approximately 21% (Wu et al., 1 Mar 2026). Several conclusions are stated directly. First, supervised trajectories are essential for Qwen2.5-3B: imitation grounding is the difference between roughly 12% and above 20%. Second, Zero-RL can nevertheless be unexpectedly strong: Llama3.2-3B with PPO in the Zero-RL regime surpasses DeepSeek-V3 on the same test split despite far fewer parameters. Third, even the best reported student models remain only in the 20–22% range, indicating considerable headroom, especially on L3 and the DeepResearch-Hard subset (Wu et al., 1 Mar 2026). The teacher-model results add a second layer of interpretation. Tongyi-DeepResearch-30B-A3B is strong on L1 and L2, but its performance collapses from approximately 72% on those levels to approximately 24% on L3. This sharp degradation, combined with the jump from 4.30 to 20.23 average tool calls, indicates that the benchmark’s main bottleneck is not ordinary multi-hop reasoning but long-horizon, search-intensive, anti-shortcut deep research (Wu et al., 1 Mar 2026). 6. Position within the deep-research ecosystemDeepResearch-9K occupies a particular niche within the broader deep-research literature. DeepResearch Bench evaluates 100 doctoral-level research tasks across 22 fields and focuses on long-form report quality under the RACE framework, while FS-Researcher studies report generation on DeepResearch Bench and DeepConsult through a file-system-based dual-agent architecture (Prateek, 28 Jan 2026, Zhu et al., 2 Feb 2026). DeepResearch-ReportEval, in turn, evaluates research reports through quality, redundancy, and factuality across 100 curated queries in 12 categories (Fan et al., 9 Oct 2025). By contrast, DeepResearch-9K centers on 9,000 search-intensive, obfuscated questions with explicit trajectories and verifiable final answers (Wu et al., 1 Mar 2026). This suggests a complementary division of labor across benchmarks. DeepResearch Bench and DeepResearch-ReportEval emphasize long-form synthesis and report evaluation, whereas DeepResearch-9K emphasizes search-intensive agent training, trajectory supervision, and tool-grounded final-answer verification (Prateek, 28 Jan 2026, Fan et al., 9 Oct 2025). DuMate-DeepResearch’s state-of-the-art results on DeepResearch Bench and DeepResearch Bench II further highlight the importance of graph-based planning, recursive search, and rubric-grounded synthesis in adjacent benchmark families (Yan et al., 5 Jun 2026). Other multimodal or agentic-search systems are conceptually close but do not directly use DeepResearch-9K. Skywork-R1V4 does not mention DeepResearch-9K and instead evaluates on MMSearch, FVQA, and BrowseComp-VL, despite being explicitly designed as a multimodal “DeepResearch-style” agent (Zhang et al., 2 Dec 2025). MM-DeepResearch likewise does not mention DeepResearch-9K and instead introduces Hyper-Search-3K and DR-TTS-10K while evaluating on SimpleVQA, MMSearch, LiveVQA, FVQA, InfoSeek, and BrowseComp-VL (Yao et al., 1 Mar 2026). Tongyi DeepResearch is directly aimed at long-horizon, web-browsing, multi-step research problems of the same general type, and Tongyi-DeepResearch-30B-A3B is the teacher that provides DeepResearch-9K trajectories (Team et al., 28 Oct 2025). The benchmark also has explicit limitations. Its construction depends heavily on Serper API and DeepSeek-V3, so changes in search backends or judges may alter instance difficulty and evaluation reliability. The source datasets are English and mostly Wikipedia-centric, leading to English-only and domain biases. The queries are synthetic narratives, which means that some aspects of real-world research—such as noisy, contradictory sources or multimodal evidence—are not fully captured. The use of DeepSeek-V3 as both judge and reward model introduces possible judge bias and preference leakage, and the paper notes contamination risk because teacher and judge models may have seen underlying facts during web pretraining (Wu et al., 1 Mar 2026). Taken together, DeepResearch-9K is best understood as a large-scale, trajectory-bearing bridge between classic multi-hop QA and contemporary deep-research agents. Its distinctive contribution is not only scale—9,000 instances across L1, L2, and L3—but the coupling of controlled search difficulty, teacher-generated reasoning traces, and an open RL training framework.
4.
8.
|