DeepResearch-9K: Deep Web Exploration Benchmark

Updated 4 July 2026

DeepResearch-9K is a benchmark designed for LLM-based deep-research agents to autonomously perform long-horizon web exploration and multi-step reasoning.
It utilizes a four-stage synthesis pipeline with progressive entity obfuscation and controlled reasoning-chain depth to create three distinct difficulty levels (L1–L3).
Integrated with the DeepResearch-R1 framework, it supports reinforcement learning paradigms and teacher-guided trajectory supervision for verifiable answer production.

DeepResearch-9K is a benchmark and training resource for deep-research agents: LLM-based systems that autonomously perform long-horizon web exploration, repeated tool use, and multi-step reasoning to answer difficult, obfuscated questions. Introduced together with the DeepResearch-R1 framework, it contains 9,000 questions spanning three difficulty levels from L1 to L3, high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, and verifiable answers (Wu et al., 1 Mar 2026).

1. Definition and task model

In the DeepResearch-9K formulation, a deep-research agent is an LLM-based agent that interacts with the live or simulated web via search tools, repeatedly and adaptively; breaks a complex, often obfuscated query into subproblems and plans multi-step search trajectories; integrates evidence across multiple sources, verifies hypotheses, and converges on a precise, verifiable answer; and maintains an explicit internal reasoning chain via <Think> blocks that guides external actions (Wu et al., 1 Mar 2026).

This task framing is more demanding than standard multi-hop QA over a static corpus. In datasets such as HotpotQA or MuSiQue, a system can often succeed with a small number of retrieval steps from a fixed Wikipedia-like collection. DeepResearch-9K instead targets open, noisy, and incomplete search environments in which agents must cope with ambiguous descriptions, missing explicit names, and long chains of related entities spread across many pages (Wu et al., 1 Mar 2026).

The benchmark is also explicitly tied to tool-use difficulty. Difficulty is not treated as a purely semantic property of the question; it is operationalized through reasoning-chain depth, entity obfuscation, and the number of required search queries. This design places DeepResearch-9K in the same general family as newer deep-research benchmarks, but with an explicit emphasis on trajectory-supervised agent training and verifiable answer production (Wu et al., 1 Mar 2026).

2. Autonomous construction pipeline and difficulty structure

DeepResearch-9K is built from three open-source multi-hop QA datasets: 2WikiMultihopQA, HotpotQA, and MuSiQue. The construction process samples 1,000 instances from each source dataset, producing an initial pool of 3,000 seeds. Using DeepSeek-V3, the pipeline extracts key entities from questions and contexts—names, locations, dates, and other salient identifiers—and assembles them into a multi-hop relational graph from which paths are selected to control reasoning horizon and difficulty (Wu et al., 1 Mar 2026).

The synthesis pipeline has four stages. First, seed entity identification and extraction builds the entity graph. Second, progressive reasoning chain construction defines the task structure for L1, L2, and L3. Third, progressive entity obfuscation replaces explicit names, dates, and locations with increasingly indirect descriptions. Fourth, rule-based quality assurance enforces construction guidelines, chaining constraints, and format consistency. The reported construction cost is approximately $200 in Serper API fees, plus approximately 8,064 A100 hours for teacher-agent generation and evaluation (Wu et al., 1 Mar 2026).

The three difficulty levels are structurally distinct rather than merely graded by answer rarity.

Level Construction pattern Average tool calls

L1 Direct Attribute Mapping; light obfuscation 4.30

Level	Construction pattern	Average tool calls
L1	Direct Attribute Mapping; light obfuscation	4.30
L2	Multi-hop Relational Bridging with chain $A \rightarrow B \rightarrow C $</td> <td style="text-align: right">10.74</td> </tr> <tr> <td>L3</td> <td>Deep-Research chain of 5–6 entities with <a href="https://www.emergentmind.com/topics/livecodebench-hard" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">hard</a> link rule</td> <td style="text-align: right">20.23</td> </tr> </tbody></table></div> <p>For L1, the pipeline starts from a seed entity and extracts 3 factual attributes, with at least one using a less common synonym or technical term; the resulting QA is designed to be solvable with 1–2 search steps. For L2, the question hides the final target$ C $behind intermediate entities$ A $and$ B $, forcing explicit relational bridging. For L3, the construction uses a “relay race” chain of 5–6 entities$ A \rightarrow B \rightarrow C \rightarrow D \rightarrow E \rightarrow \text{Target}$, imposes a hard link rule requiring a new independent search query at each hop, and enforces that no single knowledge source may contain more than two consecutive entities (Wu et al., 1 Mar 2026). Obfuscation likewise scales by level. L1 uses light substitution such as replacing “Beijing” with “the capital of China.” L2 prohibits proper names, exact dates, and specific locations for 1–2 critical entities. L3 applies obfuscation across the entire narrative, using no proper names, no exact dates, and no exact locations, while encoding the final question as a dense paragraph with nested descriptions. This suggests that DeepResearch-9K is designed not merely to test retrieval breadth, but to test semantic decoding of indirect referents under web-search constraints (Wu et al., 1 Mar 2026). 3. Trajectories, verifiable answers, and evaluation protocol Each DeepResearch-9K instance includes a search trajectory with reasoning chains generated by Tongyi-DeepResearch-30B-A3B. These trajectories are sequences of `<Think>` blocks, `<tool_call>` blocks such as `search([...])` and `google_scholar([...])`, `<tool_response>` blocks containing summaries of search results, and a final answer. The trajectory representation is intended both as training supervision and as an analysis resource for agentic reasoning and tool-use behavior (Wu et al., 1 Mar 2026). Final answers are specific, concrete entities and are verified with an LLM-as-judge mechanism using DeepSeek-V3. Correctness is determined by semantic equivalence rather than exact string match. The same judging paradigm is used in evaluation, where the principal metrics are Accuracy and Search Count. Accuracy is the proportion of questions whose final answers are judged correct by DeepSeek-V3, and Search Count is the number of search-tool invocations in the trajectory (Wu et al., 1 Mar 2026). Teacher-model results establish the dataset’s difficulty profile. Tongyi-DeepResearch-30B-A3B achieves 72.47% on L1, 71.33% on L2, 23.73% on L3, and 55.84% overall. On BrowseComp-Plus, the same teacher achieves 24.94%. The near match between L3 accuracy and BrowseComp-Plus accuracy is used to argue that L3 approximates the difficulty of a hard public deep-research benchmark (Wu et al., 1 Mar 2026). The dataset also defines a hard subset, DeepResearch-Hard, containing 3,974 questions on which Tongyi-DeepResearch-30B-A3B fails. This subset isolates the region where current strong agents remain unreliable, especially on long chains of obfuscated entities, situations with multiple partially matching candidates, and tasks where evidence is dispersed across many pages (Wu et al., 1 Mar 2026). 4. DeepResearch-R1: open training framework DeepResearch-R1 is the accompanying open-source training framework for deep-research agents. It is built on top of Search-R1 and is designed to support multi-turn web interactions, different reinforcement learning approaches, and different reward models such as rule-based outcome reward and LLM-as-judge feedback (Wu et al., 1 Mar 2026). The framework models an episode as a multi-turn sequence in which the environment presents a question, the agent emits a `<Think>` block and one or more `<tool_call>` actions, the environment executes those calls and returns `<tool_response>` observations, and the process continues until the agent emits a final answer or a maximum-step threshold is reached. vLLM is used for efficient rollout and inference during RL training, and the experiments use Serper API as the search backend (Wu et al., 1 Mar 2026). DeepResearch-R1 supports PPO and GRPO. The paper describes PPO with the standard clipped policy-gradient objective and characterizes GRPO as a group-relative variant that normalizes rewards within groups or batches. In practice, DeepSeek-V3 functions as the main reward model, judging the correctness of final answers and converting that judgment into scalar rewards for RL (Wu et al., 1 Mar 2026). Two training paradigms are emphasized. In Zero-RL, the system starts from a base model such as Qwen2.5-3B or Llama3.2-3B and applies RL directly on the DeepResearch-9K training set. In SFT+RL, the framework first performs supervised fine-tuning on teacher trajectories and then refines the policy with RL. The paper reports 5,026 teacher-correct trajectories; adds a random subset of 2,200 incorrect samples to improve RL robustness; uses 7,226 instances for training; and reserves the remaining 1,774 as a test split (Wu et al., 1 Mar 2026). This design makes DeepResearch-9K atypical among evaluation-only benchmarks. It is not only a testbed but also a trajectory-bearing training corpus with an explicit RL environment. A plausible implication is that the dataset is intended to study how search policies, reasoning traces, and reward design interact, rather than only how a frozen model scores on final-answer accuracy. 5. Empirical findings and training implications The evaluation on the 1,774-question DeepResearch-9K test split shows that difficulty remains substantial even after training. DeepSeek-V3, used here as an inference-only baseline, reaches 20.18% accuracy. For Qwen2.5-3B, Zero-RL gives approximately 12–13% accuracy, while SFT+RL pushes performance above 20%. For Llama3.2-3B, Zero-RL with PPO reaches 22.50%, the best reported result among the student-model settings, while GRPO and SFT+RL variants are around approximately 21% (Wu et al., 1 Mar 2026). Several conclusions are stated directly. First, supervised trajectories are essential for Qwen2.5-3B: imitation grounding is the difference between roughly 12% and above 20%. Second, Zero-RL can nevertheless be unexpectedly strong: Llama3.2-3B with PPO in the Zero-RL regime surpasses DeepSeek-V3 on the same test split despite far fewer parameters. Third, even the best reported student models remain only in the 20–22% range, indicating considerable headroom, especially on L3 and the DeepResearch-Hard subset (Wu et al., 1 Mar 2026). The teacher-model results add a second layer of interpretation. Tongyi-DeepResearch-30B-A3B is strong on L1 and L2, but its performance collapses from approximately 72% on those levels to approximately 24% on L3. This sharp degradation, combined with the jump from 4.30 to 20.23 average tool calls, indicates that the benchmark’s main bottleneck is not ordinary multi-hop reasoning but long-horizon, search-intensive, anti-shortcut deep research (Wu et al., 1 Mar 2026). 6. Position within the deep-research ecosystem DeepResearch-9K occupies a particular niche within the broader deep-research literature. DeepResearch Bench evaluates 100 doctoral-level research tasks across 22 fields and focuses on long-form report quality under the RACE framework, while FS-Researcher studies report generation on DeepResearch Bench and DeepConsult through a file-system-based dual-agent architecture (Prateek, 28 Jan 2026, Zhu et al., 2 Feb 2026). DeepResearch-ReportEval, in turn, evaluates research reports through quality, redundancy, and factuality across 100 curated queries in 12 categories (Fan et al., 9 Oct 2025). By contrast, DeepResearch-9K centers on 9,000 search-intensive, obfuscated questions with explicit trajectories and verifiable final answers (Wu et al., 1 Mar 2026). This suggests a complementary division of labor across benchmarks. DeepResearch Bench and DeepResearch-ReportEval emphasize long-form synthesis and report evaluation, whereas DeepResearch-9K emphasizes search-intensive agent training, trajectory supervision, and tool-grounded final-answer verification (Prateek, 28 Jan 2026, Fan et al., 9 Oct 2025). DuMate-DeepResearch’s state-of-the-art results on DeepResearch Bench and DeepResearch Bench II further highlight the importance of graph-based planning, recursive search, and rubric-grounded synthesis in adjacent benchmark families (Yan et al., 5 Jun 2026). Other multimodal or agentic-search systems are conceptually close but do not directly use DeepResearch-9K. Skywork-R1V4 does not mention DeepResearch-9K and instead evaluates on MMSearch, FVQA, and BrowseComp-VL, despite being explicitly designed as a multimodal “DeepResearch-style” agent (Zhang et al., 2 Dec 2025). MM-DeepResearch likewise does not mention DeepResearch-9K and instead introduces Hyper-Search-3K and DR-TTS-10K while evaluating on SimpleVQA, MMSearch, LiveVQA, FVQA, InfoSeek, and BrowseComp-VL (Yao et al., 1 Mar 2026). Tongyi DeepResearch is directly aimed at long-horizon, web-browsing, multi-step research problems of the same general type, and Tongyi-DeepResearch-30B-A3B is the teacher that provides DeepResearch-9K trajectories (Team et al., 28 Oct 2025). The benchmark also has explicit limitations. Its construction depends heavily on Serper API and DeepSeek-V3, so changes in search backends or judges may alter instance difficulty and evaluation reliability. The source datasets are English and mostly Wikipedia-centric, leading to English-only and domain biases. The queries are synthetic narratives, which means that some aspects of real-world research—such as noisy, contradictory sources or multimodal evidence—are not fully captured. The use of DeepSeek-V3 as both judge and reward model introduces possible judge bias and preference leakage, and the paper notes contamination risk because teacher and judge models may have seen underlying facts during web pretraining (Wu et al., 1 Mar 2026). Taken together, DeepResearch-9K is best understood as a large-scale, trajectory-bearing bridge between classic multi-hop QA and contemporary deep-research agents. Its distinctive contribution is not only scale—9,000 instances across L1, L2, and L3—but the coupling of controlled search difficulty, teacher-generated reasoning traces, and an open RL training framework. Markdown Report Issue Upgrade to Chat References (8) 1. DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent (2026) 2. Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve) (2026) 3. FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents (2026) 4. Understanding DeepResearch via Reports (2025) 5. DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning (2026) 6. Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch (2025) 7. MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline (2026) 8. Tongyi DeepResearch Technical Report (2025) Topic to Video (Beta) No one has generated a video about this topic yet. Sign Up to Generate All Videos Subscribe on YouTube Whiteboard No one has generated a whiteboard explanation for this topic yet. Sign Up to Generate Follow Topic Get notified by email when new papers are published related to DeepResearch-9K. Sign Up to Follow Topic by Email Continue Learning How does DeepResearch-9K address the challenges of multi-step reasoning in noisy, incomplete search environments? What are the key differences between the three difficulty levels (L1–L3) in DeepResearch-9K? How does the progressive entity obfuscation technique enhance the benchmark's tool-use difficulty? In what ways does the DeepResearch-R1 framework facilitate effective RL training for deep-research agents? Find recent papers about deep-research benchmarks. Related Topics ReSearch: Deep Research Paradigm Deep-Research Agents Step-DeepResearch Workflow Wide and Deep Research Agent Gen-Searcher-SFT-10k: Multimodal SFT Dataset SynPlanResearch-R1: Deep-Research Agent Framework Super Research Insights LiteResearcher: Scalable Agentic RL Framework DeepResearch-R1: Open-Source Research Agent Framework SlimSearcher: Efficient Web Research Agent Content Overview References Topic to Video Whiteboard Follow Topic Continue Learning Related Topics Stay informed about trending AI papers: About Labs API Email Digest Chrome Extension RSS Terms Privacy Contact Twitter Discord

Multi-hop Relational Bridging with chain $A \rightarrow B \rightarrow C

</td> <td style="text-align: right">10.74</td> </tr> <tr> <td>L3</td> <td>Deep-Research chain of 5–6 entities with <a href="https://www.emergentmind.com/topics/livecodebench-hard" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">hard</a> link rule</td> <td style="text-align: right">20.23</td> </tr> </tbody></table></div> <p>For L1, the pipeline starts from a seed entity and extracts 3 factual attributes, with at least one using a less common synonym or technical term; the resulting QA is designed to be solvable with 1–2 search steps. For L2, the question hides the final target

behind intermediate entities

and

, forcing explicit relational bridging. For L3, the construction uses a “relay race” chain of 5–6 entities

A \rightarrow B \rightarrow C \rightarrow D \rightarrow E \rightarrow \text{Target}$, imposes a hard link rule requiring a new independent search query at each hop, and enforces that no single knowledge source may contain more than two consecutive entities (Wu et al., 1 Mar 2026).

Obfuscation likewise scales by level. L1 uses light substitution such as replacing “Beijing” with “the capital of China.” L2 prohibits proper names, exact dates, and specific locations for 1–2 critical entities. L3 applies obfuscation across the entire narrative, using no proper names, no exact dates, and no exact locations, while encoding the final question as a dense paragraph with nested descriptions. This suggests that DeepResearch-9K is designed not merely to test retrieval breadth, but to test semantic decoding of indirect referents under web-search constraints (Wu et al., 1 Mar 2026).

3. Trajectories, verifiable answers, and evaluation protocol

Each DeepResearch-9K instance includes a search trajectory with reasoning chains generated by Tongyi-DeepResearch-30B-A3B. These trajectories are sequences of <Think> blocks, <tool_call> blocks such as search([...]) and google_scholar([...]), <tool_response> blocks containing summaries of search results, and a final answer. The trajectory representation is intended both as training supervision and as an analysis resource for agentic reasoning and tool-use behavior (Wu et al., 1 Mar 2026).

Final answers are specific, concrete entities and are verified with an LLM-as-judge mechanism using DeepSeek-V3. Correctness is determined by semantic equivalence rather than exact string match. The same judging paradigm is used in evaluation, where the principal metrics are Accuracy and Search Count. Accuracy is the proportion of questions whose final answers are judged correct by DeepSeek-V3, and Search Count is the number of search-tool invocations in the trajectory (Wu et al., 1 Mar 2026).

Teacher-model results establish the dataset’s difficulty profile. Tongyi-DeepResearch-30B-A3B achieves 72.47% on L1, 71.33% on L2, 23.73% on L3, and 55.84% overall. On BrowseComp-Plus, the same teacher achieves 24.94%. The near match between L3 accuracy and BrowseComp-Plus accuracy is used to argue that L3 approximates the difficulty of a hard public deep-research benchmark (Wu et al., 1 Mar 2026).

The dataset also defines a hard subset, DeepResearch-Hard, containing 3,974 questions on which Tongyi-DeepResearch-30B-A3B fails. This subset isolates the region where current strong agents remain unreliable, especially on long chains of obfuscated entities, situations with multiple partially matching candidates, and tasks where evidence is dispersed across many pages (Wu et al., 1 Mar 2026).

4. DeepResearch-R1: open training framework

DeepResearch-R1 is the accompanying open-source training framework for deep-research agents. It is built on top of Search-R1 and is designed to support multi-turn web interactions, different reinforcement learning approaches, and different reward models such as rule-based outcome reward and LLM-as-judge feedback (Wu et al., 1 Mar 2026).

The framework models an episode as a multi-turn sequence in which the environment presents a question, the agent emits a <Think> block and one or more <tool_call> actions, the environment executes those calls and returns <tool_response> observations, and the process continues until the agent emits a final answer or a maximum-step threshold is reached. vLLM is used for efficient rollout and inference during RL training, and the experiments use Serper API as the search backend (Wu et al., 1 Mar 2026).

DeepResearch-R1 supports PPO and GRPO. The paper describes PPO with the standard clipped policy-gradient objective and characterizes GRPO as a group-relative variant that normalizes rewards within groups or batches. In practice, DeepSeek-V3 functions as the main reward model, judging the correctness of final answers and converting that judgment into scalar rewards for RL (Wu et al., 1 Mar 2026).

Two training paradigms are emphasized. In Zero-RL, the system starts from a base model such as Qwen2.5-3B or Llama3.2-3B and applies RL directly on the DeepResearch-9K training set. In SFT+RL, the framework first performs supervised fine-tuning on teacher trajectories and then refines the policy with RL. The paper reports 5,026 teacher-correct trajectories; adds a random subset of 2,200 incorrect samples to improve RL robustness; uses 7,226 instances for training; and reserves the remaining 1,774 as a test split (Wu et al., 1 Mar 2026).

This design makes DeepResearch-9K atypical among evaluation-only benchmarks. It is not only a testbed but also a trajectory-bearing training corpus with an explicit RL environment. A plausible implication is that the dataset is intended to study how search policies, reasoning traces, and reward design interact, rather than only how a frozen model scores on final-answer accuracy.

5. Empirical findings and training implications

The evaluation on the 1,774-question DeepResearch-9K test split shows that difficulty remains substantial even after training. DeepSeek-V3, used here as an inference-only baseline, reaches 20.18% accuracy. For Qwen2.5-3B, Zero-RL gives approximately 12–13% accuracy, while SFT+RL pushes performance above 20%. For Llama3.2-3B, Zero-RL with PPO reaches 22.50%, the best reported result among the student-model settings, while GRPO and SFT+RL variants are around approximately 21% (Wu et al., 1 Mar 2026).

Several conclusions are stated directly. First, supervised trajectories are essential for Qwen2.5-3B: imitation grounding is the difference between roughly 12% and above 20%. Second, Zero-RL can nevertheless be unexpectedly strong: Llama3.2-3B with PPO in the Zero-RL regime surpasses DeepSeek-V3 on the same test split despite far fewer parameters. Third, even the best reported student models remain only in the 20–22% range, indicating considerable headroom, especially on L3 and the DeepResearch-Hard subset (Wu et al., 1 Mar 2026).

The teacher-model results add a second layer of interpretation. Tongyi-DeepResearch-30B-A3B is strong on L1 and L2, but its performance collapses from approximately 72% on those levels to approximately 24% on L3. This sharp degradation, combined with the jump from 4.30 to 20.23 average tool calls, indicates that the benchmark’s main bottleneck is not ordinary multi-hop reasoning but long-horizon, search-intensive, anti-shortcut deep research (Wu et al., 1 Mar 2026).

6. Position within the deep-research ecosystem

DeepResearch-9K occupies a particular niche within the broader deep-research literature. DeepResearch Bench evaluates 100 doctoral-level research tasks across 22 fields and focuses on long-form report quality under the RACE framework, while FS-Researcher studies report generation on DeepResearch Bench and DeepConsult through a file-system-based dual-agent architecture (Prateek, 28 Jan 2026, Zhu et al., 2 Feb 2026). DeepResearch-ReportEval, in turn, evaluates research reports through quality, redundancy, and factuality across 100 curated queries in 12 categories (Fan et al., 9 Oct 2025). By contrast, DeepResearch-9K centers on 9,000 search-intensive, obfuscated questions with explicit trajectories and verifiable final answers (Wu et al., 1 Mar 2026).

This suggests a complementary division of labor across benchmarks. DeepResearch Bench and DeepResearch-ReportEval emphasize long-form synthesis and report evaluation, whereas DeepResearch-9K emphasizes search-intensive agent training, trajectory supervision, and tool-grounded final-answer verification (Prateek, 28 Jan 2026, Fan et al., 9 Oct 2025). DuMate-DeepResearch’s state-of-the-art results on DeepResearch Bench and DeepResearch Bench II further highlight the importance of graph-based planning, recursive search, and rubric-grounded synthesis in adjacent benchmark families (Yan et al., 5 Jun 2026).

Other multimodal or agentic-search systems are conceptually close but do not directly use DeepResearch-9K. Skywork-R1V4 does not mention DeepResearch-9K and instead evaluates on MMSearch, FVQA, and BrowseComp-VL, despite being explicitly designed as a multimodal “DeepResearch-style” agent (Zhang et al., 2 Dec 2025). MM-DeepResearch likewise does not mention DeepResearch-9K and instead introduces Hyper-Search-3K and DR-TTS-10K while evaluating on SimpleVQA, MMSearch, LiveVQA, FVQA, InfoSeek, and BrowseComp-VL (Yao et al., 1 Mar 2026). Tongyi DeepResearch is directly aimed at long-horizon, web-browsing, multi-step research problems of the same general type, and Tongyi-DeepResearch-30B-A3B is the teacher that provides DeepResearch-9K trajectories (Team et al., 28 Oct 2025).

The benchmark also has explicit limitations. Its construction depends heavily on Serper API and DeepSeek-V3, so changes in search backends or judges may alter instance difficulty and evaluation reliability. The source datasets are English and mostly Wikipedia-centric, leading to English-only and domain biases. The queries are synthetic narratives, which means that some aspects of real-world research—such as noisy, contradictory sources or multimodal evidence—are not fully captured. The use of DeepSeek-V3 as both judge and reward model introduces possible judge bias and preference leakage, and the paper notes contamination risk because teacher and judge models may have seen underlying facts during web pretraining (Wu et al., 1 Mar 2026).

Taken together, DeepResearch-9K is best understood as a large-scale, trajectory-bearing bridge between classic multi-hop QA and contemporary deep-research agents. Its distinctive contribution is not only scale—9,000 instances across L1, L2, and L3—but the coupling of controlled search difficulty, teacher-generated reasoning traces, and an open RL training framework.