DeepResearch-R1: Open-Source Research Agent Framework
- DeepResearch-R1 is an open-source framework for training research agents, enabling multi-turn web interactions and tool-augmented reasoning.
- It integrates plug-and-play reinforcement learning algorithms, rule-based and LLM-as-judge rewards, and a structured curriculum across difficulty levels.
- The framework spans domain-specific and evaluative applications, including medical agents with specialized retrieval and report-quality evaluation systems.
DeepResearch-R1 is a polysemous designation in the recent deep-research literature. In its most concrete and fully specified use, it denotes an open-source training framework for end-to-end deep-research agents introduced alongside the DeepResearch-9K dataset, with explicit support for multi-turn web interaction, plug-and-play reinforcement-learning algorithms, and both rule-based and LLM-as-judge reward models (Wu et al., 1 Mar 2026). The same label also appears as an alternative name for MedResearcher-R1, a medical deep research agent built on a ReAct backbone with medical-specific retrieval and knowledge-informed trajectory synthesis (Yu et al., 20 Aug 2025), and as an expository label for the DeepResearch-ReportEval framework, which evaluates research reports along quality, redundancy, and factuality dimensions (Fan et al., 9 Oct 2025). Taken together, these usages place DeepResearch-R1 at the intersection of agent training, domain specialization, and report-level evaluation.
1. Terminological scope and disambiguation
The literature does not use the string “DeepResearch-R1” as a single unambiguous proper noun. Instead, the label is attached to multiple artifacts that share a common problem setting: long-horizon, tool-augmented research over heterogeneous evidence.
| Usage | Description | arXiv id |
|---|---|---|
| DeepResearch-R1 framework | Open-source agent-training framework released with DeepResearch-9K | (Wu et al., 1 Mar 2026) |
| MedResearcher-R1 / DeepResearch-R1 | Medical deep research agent with KISA and private medical retrieval | (Yu et al., 20 Aug 2025) |
| DeepResearch-ReportEval (“DeepResearch-R1”) | Report-centric evaluation framework for DeepResearch systems | (Fan et al., 9 Oct 2025) |
The most substantive technical treatment of DeepResearch-R1 as a framework appears in the DeepResearch-9K work. There, DeepResearch-R1 is defined as an environment and training stack for “deep research,” understood as multi-step, tool-augmented reasoning with web interaction, search iteration, and terminal answer generation (Wu et al., 1 Mar 2026). The alternative medical usage preserves the same general agentic template but replaces the generic retrieval substrate with a domain-specific toolset and training pipeline (Yu et al., 20 Aug 2025). The evaluative usage shifts the term from agent construction to agent assessment, using reports as the principal object of measurement (Fan et al., 9 Oct 2025).
This multiplicity of meanings suggests that “DeepResearch-R1” functions less as a uniquely fixed system name than as a marker of a broader research program centered on iterative search, reasoning, and synthesis.
2. DeepResearch-R1 as an open-source training framework
In the DeepResearch-9K line of work, DeepResearch-R1 is an open-source framework designed for end-to-end training of deep-research agents (Wu et al., 1 Mar 2026). Its stated goals are threefold: to provide a fully featured environment for multi-turn web search and reasoning, to support a plug-and-play set of RL algorithms such as PPO and GRPO, and to enable multiple reward formulations, including rule-based outcome rewards and LLM-as-judge feedback.
The environment interface exposes three action types: <Think>, <tool_call>(query), and Evaluate(answer) (Wu et al., 1 Mar 2026). At each step, the agent observes the original question, its previous <Think> blocks, and the raw search results returned by the last tool call. The state is therefore a growing transcript of reasoning steps, tool invocations, and tool responses. On the policy side, the agent is a transformer-based LLM extended with two heads: a token-generation head for natural-language planning or final answers, and a discrete action head for formatting and emitting search queries. The framework describes this design for backbones such as Qwen-2.5-3B and Llama-3.2-3B.
The architecture is notable for making web interaction a first-class training primitive rather than a post hoc wrapper around a static corpus. That choice aligns the framework with a view of deep research as sequential decision-making over a mutable transcript rather than one-shot retrieval-augmented generation.
3. Data synthesis, trajectory construction, and curriculum design
DeepResearch-R1 is tightly coupled to DeepResearch-9K, a dataset of 9,000 questions spanning three difficulty levels, L1 through L3, paired with high-quality search trajectories and verifiable answers (Wu et al., 1 Mar 2026). The dataset construction pipeline begins by extracting 1,000 instances each from HotpotQA, 2WikiMultihopQA, and MuSiQue. It then performs entity graph construction via the DeepSeek-V3 API, uses a chain-building prompt to generate relay chains under level-specific constraints, rewrites them into narrative QA prompts with progressive obfuscation, and runs Tongyi-DeepResearch-30B-A3B to collect expert trajectories and final answers. A subsequent LLM-as-judge verification stage flags a “hard” subset in which the teacher fails.
The curriculum is explicit. L1 corresponds to “Light Obfuscation,” with single-hop or direct attribute queries and 1–2 tool calls; L2 to “Moderate Obfuscation,” with explicit two-hop chains and 4–8 tool calls; L3 to “High Obfuscation,” with long chains, at least 15 tool calls, and dense narrative phrasing without names or dates (Wu et al., 1 Mar 2026). Training batches are sampled across these levels so that the agent gradually encounters harder problems.
The framework also defines two training paradigms. In the “Zero-RL Stage,” a base LLM without supervised fine-tuning is trained directly by RL on 7,226 trajectories, including 5,026 correct and 2,200 incorrect ones. In the “SFT+RL Stage,” the model is first supervised on the 5,026 correct trajectories and then refined by RL on the full 7,226-instance set (Wu et al., 1 Mar 2026). This separation makes the framework suitable for studying cold-start RL versus supervised warm-starting under otherwise matched conditions.
4. Optimization objectives and benchmark results
DeepResearch-R1 supports both PPO and GRPO within a unified rollout-and-update loop (Wu et al., 1 Mar 2026). The supervised phase uses
where denotes correct teacher trajectories. For PPO, the framework specifies a clipped objective with , value-loss weight , entropy bonus , learning rate , batch size $16$, and four epochs per update. The discount factor is for all RL experiments. GRPO is described as using the same form as PPO but with vector rewards and group-wise advantage normalization.
Two reward families are implemented. The rule-based outcome reward gives +1 for an exactly correct terminal answer and 0 otherwise, with an optional per-tool-call penalty such as to discourage gratuitous queries (Wu et al., 1 Mar 2026). The LLM-as-judge reward feeds the full trajectory to a judge model such as DeepSeek-V3, which returns a scalar score between 0 and 1. When GRPO is used, rule-based and judge-derived rewards can be combined as a reward vector.
Empirically, the teacher Tongyi-DeepResearch-30B-A3B reaches 72.47% on L1, 71.33% on L2, 23.73% on L3, and 55.84% overall on the DeepResearch-9K test split; on BrowseComp-Plus it attains 24.94% (Wu et al., 1 Mar 2026). Among trained small models, Qwen-2.5-3B with PPO SFT+RL reaches 21.09%, Qwen-2.5-3B with GRPO SFT+RL 20.31%, and Llama-3.2-3B with PPO Zero-RL 22.50%, which is identified as the best result among those reported. The paper’s own summary emphasizes that supervised warm-starting is critical for Qwen-2.5-3B, that well-aligned rewards can let small models surpass large closed-source baselines, and that performance near 20–22% still reflects the extreme difficulty of L3 instances (Wu et al., 1 Mar 2026).
These numbers situate DeepResearch-R1 as both a training framework and a stress test for open-source deep-research agents. Its low absolute scores on the hardest settings are not incidental; they are part of the framework’s intended function as a difficult benchmark.
5. Domain-specific and evaluative variants
A distinct but related usage appears in MedResearcher-R1, which is also called DeepResearch-R1 in the paper’s detailed exposition (Yu et al., 20 Aug 2025). MedResearcher-R1 is an LLM-based agent on a ReAct backbone, extended with a dual toolset, a knowledge-informed trajectory synthesis framework called KISA, and a two-stage training paradigm consisting of supervised fine-tuning and online reinforcement learning. Its state is written as , where 0 is dialogue context, 1 an accumulated knowledge graph from retrievals, and 2 a reasoning history. Tool selection is part of the policy, with general-purpose tools such as WebSearch and DocumentRead, and medical-specific tools including PrivateMedicalRetriever and ClinicalReasoningEngine (Yu et al., 20 Aug 2025).
KISA constructs medical knowledge graphs around rare medical entities, extracts medically valid longest paths, masks entities in gold triple chains, and forces the model to reconstruct the chain via alternating reasoning and tool use (Yu et al., 20 Aug 2025). The system generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions. Its PrivateMedicalRetriever combines sparse recall with dense reranking and an authority term weighted by 3, explicitly favoring authority. On MedBrowseComp, the reported scores are 19.0% for o3-search, 25.5% for o3-deepresearch, and 27.5% for MedResearcher-R1-32B; on GAIA it reports 53.4, and on XBench-DeepSearch 54.0 (Yu et al., 20 Aug 2025). Ablations show drops of 5.8 percentage points without medical tools, 7.1 percentage points without KISA, and 2.2 percentage points from removing the RL stage.
Another variant of the label is evaluative rather than generative. DeepResearch-ReportEval, described in one exposition as “DeepResearch-R1,” scores reports along three axes: Quality, Redundancy, and Factuality (Fan et al., 9 Oct 2025). Quality comprises five sub-dimensions scored on a 0–4 scale: comprehensiveness, coherence, clarity, insightfulness, and overall quality. Redundancy is defined by pairwise paragraph scoring and averaged across all paragraph pairs. Factuality is computed over claim–source pairs with scores of 1 for full support, 0 for partial support, and -1 for no support. The benchmark contains 100 manually selected queries from over 150,000 real user queries, grouped into 12 categories. In validation against human experts on 120 query–report pairs, all mean absolute deviation values are below 0.8, and the framework’s rankings by overall quality plus redundancy match expert rankings 61.11% of the time (Fan et al., 9 Oct 2025).
The coexistence of these variants shows that DeepResearch-R1 can refer either to a training framework for action-selection policies or to a rubric for assessing the long-form outputs those policies produce.
6. Position within the broader deep-research literature
The framework-centered meaning of DeepResearch-R1 sits inside a rapidly expanding design space for deep-research agents. Yunque DeepResearch defines a four-module hierarchy consisting of a Main Agent, Context Manager, Atomic Capability Pool, and Supervisor, with structured memory units 4 and a context-folding mechanism that reduces complexity from 5 to 6 (Cai et al., 27 Jan 2026). DeepResearcher formulates real-web deep research as an episodic MDP with search, browse, and answer actions, and trains a Qwen2.5-7B-Instruct backbone with GRPO in authentic web environments, reporting average gains of +28.9 points over prompt-engineering baselines and +7.2 points over RAG-based RL agents (Zheng et al., 4 Apr 2025). PokeeResearch-7B instead uses annotation-free RLAIF, a research–verification cycle, and Research Threads Synthesis, reaching 36.9 on GAIA and 15.2 on HLE before RTS gains (Wan et al., 17 Oct 2025).
Other works refine the RL recipe itself. R1-Searcher uses a two-stage outcome-based RL scheme that first rewards correct search invocation and format, then rewards answer quality via F1 and penalizes format violations, outperforming ReARTeR and other baselines on HotpotQA and 2Wiki (Song et al., 7 Mar 2025). Search-R1++ systematically studies prompt template, reward function, and optimizer, finding that the Fast Thinking template is more stable than Slow Thinking, that F1-based reward needs action-level penalties to avoid answer avoidance, and that REINFORCE outperforms PPO and GRPO while using fewer search actions (Xu et al., 23 Feb 2026). SFR-DeepResearch moves in a different direction, emphasizing a native autonomous single-agent setup with tools such as search_internet, browse_page, code_interpreter, and clean_memory, and reports 28.7% on Humanity’s Last Exam for SFR-DR-20B (Nguyen et al., 8 Sep 2025).
The literature also broadens deep research beyond text-only web QA. DeepResearch7 implements recursive depth- and breadth-controlled exploration for ecological scientific synthesis, reaching a mean of 8 sources at configuration 9 and a 14.9-fold increase in sources integrated per 1,000 words relative to 0 (D'Souza et al., 14 Jul 2025). MM-DeepResearch extends the paradigm to multimodal search with Hyper-Search, DR-TTS, and an offline search engine, improving average accuracy across six benchmarks from 46.2 to 63.0 for the 8B setting and from 34.1 to 56.1 for the 7B setting (Yao et al., 1 Mar 2026). DuMate-DeepResearch further emphasizes auditable traces, graph-based dynamic planning, recursive two-level execution, and rubric-based test-time optimization, reporting 58.03% on DeepResearch Bench and 61.95% on DeepResearch Bench II (Yan et al., 5 Jun 2026).
This broader context indicates that DeepResearch-R1 is best understood not as a closed, singular architecture but as one node in a research lineage concerned with the same core technical problems: long-horizon planning, noisy retrieval, reward design, context management, auditability, and grounded synthesis.