SynPlanResearch-R1: Deep-Research Agent Framework

Updated 4 July 2026

The paper introduces a cold-start SFT stage with synthetic plan-guided trajectories to address RL exploration failures.
It employs Group Relative Policy Optimization (GRPO) to refine tool-use strategies and enhance answer quality across benchmarks.
Empirical results demonstrate up to 6.0% relative F1 improvements on multi-hop and open-web QA over state-of-the-art baselines.

SynPlanResearch-R1 is a framework for training deep-research agents that use external tools to answer complex, knowledge-intensive questions through multi-turn interaction. It was introduced as a response to a specific failure mode in reinforcement learning from verifiable rewards (RLVR): agents trained end-to-end often terminate too early and overuse a single familiar tool, which limits evidence gathering and constrains multi-step exploration. The framework addresses this cold-start bottleneck by synthesizing plan-guided tool-use trajectories for supervised fine-tuning (SFT), then refining the resulting policy with RLVR using Group Relative Policy Optimization (GRPO). On seven multi-hop and open-web benchmarks, it improves average F1 to 0.580 on Qwen3-8B and 0.529 on Qwen3-4B, corresponding to relative gains of up to 6.0% and 5.8% over state-of-the-art baselines (Zeng et al., 9 Mar 2026).

1. Research setting and motivation

SynPlanResearch-R1 is situated in the setting of research agents: LLMs equipped with tools such as web_search and crawl_webpage. These systems must dynamically interleave internal reasoning with tool use in order to collect evidence from the web and synthesize answers to multi-hop or open-domain questions (Zeng et al., 9 Mar 2026).

The framework is motivated by two exploration failures observed under pure RLVR. The first is premature termination, in which the agent makes only a small number of tool calls before producing a final answer. The second is biased tool usage, in which an agent in a multi-tool environment relies heavily on one tool while neglecting others that may provide richer information. Because RLVR is on-policy, weak initialization can trap the policy in these behaviors: the agent remains near its initial strategy and does not reliably discover longer or more varied tool-use sequences (Zeng et al., 9 Mar 2026).

This formulation places exploration quality, rather than answer generation alone, at the center of training. A plausible implication is that deep-research performance depends not only on whether a model can call tools, but also on whether its initial policy encourages sufficiently broad search trajectories for RL to improve.

2. Framework architecture

SynPlanResearch-R1 has two stages. The first stage is cold-start SFT with plan-guided synthetic data. The second stage is RLVR using GRPO to further refine the policy (Zeng et al., 9 Mar 2026).

The synthetic-data pipeline begins by sampling a random tool plan

$P=(a_1,\dots,a_L),$

where $L\in[L_{\min},L_{\max}]$ , $a_1=\text{web\_search}$ , and later actions are sampled uniformly from $\{\text{web\_search},\text{crawl\_webpage}\}$ . During rollout, the system injects action-specific cues such as “Let me think of the best query for web_search” at the start of each > block, thereby softly steering a large reasoning model to follow the sampled plan. The trajectory alternates among reasoning, tool calls, and tool responses for up to $L$ steps, and then emits a final <answer> (Zeng et al., 9 Mar 2026).

Only trajectories that satisfy strict filtering criteria are retained. The JSON format must be valid, and the final answer must either exactly match the gold answer or match it under F1-overlap. Each retained trajectory is then passed through a thought-rewriting step using Claude-3.7-Sonnet in order to remove cue artifacts while preserving the tool-use structure. The filtered and rewritten trajectories form the corpus $\mathcal{D}_{\mathrm{sft}}$ used for cold-start imitation learning (Zeng et al., 9 Mar 2026).

The supervised objective is

$L_{\text{sup}} = -\sum_{(q,y)\in\mathcal{D}_{\mathrm{sft}}} \log \pi_\theta(y\mid q).$

The paper explicitly notes that there is no separate exploration-bonus term in SFT; instead, exploration is shaped implicitly by conditioning on diverse plans and cue-guided trajectories (Zeng et al., 9 Mar 2026).

3. Training procedure and optimization

The cold-start SFT stage initializes from a pre-trained backbone such as Qwen3-8B, fine-tunes for 2.5 epochs with learning rate $5\times 10^{-6}$ , and uses a maximum sequence length of 32K tokens. Synthetic data is produced by a fixed large reasoning model, Qwen3-32B. For each question, $N=16$ plan-guided rollouts are generated using tool-plan lengths in $[3,8]$ . The SFT corpus itself is built from 8K questions mixed from multi-hop QA, SuperGPQA, and WebWalker-Silver (Zeng et al., 9 Mar 2026).

After SFT converges, the framework switches to RLVR with GRPO. For each prompt, the model samples $L\in[L_{\min},L_{\max}]$ 0 on-policy rollouts with temperature $L\in[L_{\min},L_{\max}]$ 1 and top- $L\in[L_{\min},L_{\max}]$ 2. Rollouts allow a maximum of 8 tool-calling turns, input length 16K tokens, and response length 4K tokens. The reward function combines answer overlap with format validity: valid answers with positive F1 receive $L\in[L_{\min},L_{\max}]$ 3, valid answers with zero F1 receive $L\in[L_{\min},L_{\max}]$ 4, and invalid-format trajectories or those with F1 below $L\in[L_{\min},L_{\max}]$ 5 receive $L\in[L_{\min},L_{\max}]$ 6, with $L\in[L_{\min},L_{\max}]$ 7 and $L\in[L_{\min},L_{\max}]$ 8 (Zeng et al., 9 Mar 2026).

The GRPO objective is given as

$L\in[L_{\min},L_{\max}]$ 9

where $a_1=\text{web\_search}$ 0, $a_1=\text{web\_search}$ 1 is the group-normalized advantage, and $a_1=\text{web\_search}$ 2 masks tool-response tokens and void trajectories. RL is run for 80 steps at learning rate $a_1=\text{web\_search}$ 3. Invalid or truncated trajectories are masked from the policy gradient, but their rewards are still used in group-relative advantage computation (Zeng et al., 9 Mar 2026).

The paper notes that a joint objective of the form

$a_1=\text{web\_search}$ 4

is possible in principle, but the reported experiments apply SFT and RL sequentially rather than jointly (Zeng et al., 9 Mar 2026).

4. Benchmarks and empirical performance

Evaluation covers seven tasks spanning multi-hop QA and advanced open-web QA (Zeng et al., 9 Mar 2026).

Benchmark family Tasks Scale

Multi-Hop QA HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle HotpotQA and 2WikiMultihopQA downsampled to 5K; MuSiQue and Bamboogle full

Advanced Open-Web QA GPQA, WebWalkerQA, GAIA As reported

The model backbones are Qwen3-8B and Qwen3-4B. Baselines include Direct Inference, RAG, Search-o1, Rejection Sampling, Search-R1, and SimpleDeepSearcher. On the seven-benchmark average, SynPlanResearch-R1 attains 0.580 F1 on Qwen3-8B, compared with 0.547 for SimpleDeepSearcher, and 0.529 F1 on Qwen3-4B, compared with 0.500. These correspond to relative gains of 6.0% and 5.8%, respectively. On multi-hop QA alone, the reported improvement reaches up to +5.1% absolute; on advanced QA, up to +8.7% (Zeng et al., 9 Mar 2026).

These results are presented as evidence that cold-start exploration shaping matters at both model scales. The reported gains are not confined to a single dataset family, but extend across both canonical multi-hop QA and more demanding open-web settings.

5. Tool-use behavior, ablations, and training dynamics

A central empirical claim of SynPlanResearch-R1 is that improved answers are associated with deeper exploration rather than merely with longer outputs. Tracking checkpoints at 20, 40, 60, and 80 RL steps shows that the framework maintains both higher average tool calls and higher scores than baselines, which the paper interprets as confirmation that deeper exploration leads to better answers (Zeng et al., 9 Mar 2026).

Training dynamics further support this view. The framework retains higher policy entropy throughout RL, enabling continued exploration. Although its early reward is lower, it climbs to a higher plateau, whereas naïve RL from a shallow SFT baseline converges prematurely. This makes exploration quality a measurable part of optimization, not simply a by-product (Zeng et al., 9 Mar 2026).

The ablation results isolate several components. Removing cue injection drops multi-hop QA by approximately 0.31 and GAIA by approximately 0.38. Restricting the system to web_search only degrades GAIA by 0.80, highlighting the importance of crawl_webpage. Shortening the maximum tool budget or lowering rollout temperature also hurts performance significantly (Zeng et al., 9 Mar 2026).

Pre-training statistics on tool-use structure make the same point numerically. Standard ReAct yields 1.40 tool calls per question with no plan adherence. Adding a tool plan raises this to 2.05 tool calls with 28.3% plan adherence. Adding both tool plan and cue injection yields 4.36 tool calls with 76.9% plan adherence. This suggests that the synthetic-planning mechanism primarily acts by changing the agent’s exploration prior before RL begins (Zeng et al., 9 Mar 2026).

6. Relation to AI-powered research planning

SynPlanResearch-R1 belongs to a broader line of work that treats planning as a first-class problem for scientific or research-oriented LLM systems, but it targets a different object of planning than several adjacent efforts.

“Idea2Plan: Exploring AI-Powered Research Planning” defines research planning as the transformation from a concise scientific idea into a five-section structured research plan, and introduces Idea2Plan Bench and Idea2Plan JudgeEval to measure that capability (Huang et al., 28 Oct 2025). In that setting, the output is a document-like plan containing an introduction, key literatures, methods, experimental design, and resources, compliance, and ethical considerations. By contrast, SynPlanResearch-R1 operates at the level of tool-use trajectories for answering multi-hop and open-web questions rather than at the level of abstract research proposals (Zeng et al., 9 Mar 2026).

“Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward” proposes DecomposeR, which represents research plans as typed directed acyclic graphs and assigns explicit planner rewards to graph structure, search quality, and synthesis behavior (Hussain et al., 29 May 2026). SynPlanResearch-R1 differs in its mechanism: instead of rewarding an explicit plan object such as a typed DAG, it synthesizes trajectories that encourage longer and more diverse sequences of tool calls and uses those trajectories to shape the initialization for RL (Zeng et al., 9 Mar 2026).

Taken together, these works outline distinct but complementary formulations of research planning. Idea2Plan evaluates the transition from idea to executable plan (Huang et al., 28 Oct 2025); DecomposeR makes planning explicit and rewardable through graph structure (Hussain et al., 29 May 2026); SynPlanResearch-R1 addresses exploration failure in tool-using research agents through synthetic plan-guided imitation and subsequent RL refinement (Zeng et al., 9 Mar 2026).

7. Limitations and future directions

The paper’s analysis emphasizes that RLVR alone yields only marginal gains over imitation learning when the starting policy explores poorly. SynPlanResearch-R1 is therefore presented not as a replacement for RL, but as a mechanism for improving the initial exploration distribution that RL can subsequently refine (Zeng et al., 9 Mar 2026).

The reported future directions remain closely tied to that premise. They include scaling to a greater variety of tools such as databases and calculators, automating cue and plan generation, integrating off-policy RL methods to better exploit synthetic trajectories, and applying the framework to non-QA domains including code synthesis, data analysis, and multimodal research (Zeng et al., 9 Mar 2026). A plausible implication is that the framework’s main contribution is methodological rather than domain-specific: it provides a way to control exploration structure before on-policy RL begins.

The code and data are publicly available at the project repository, reflecting the framework’s emphasis on reproducible agent-training pipelines (Zeng et al., 9 Mar 2026).

Benchmark family	Tasks	Scale
Multi-Hop QA	HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle	HotpotQA and 2WikiMultihopQA downsampled to 5K; MuSiQue and Bamboogle full
Advanced Open-Web QA	GPQA, WebWalkerQA, GAIA	As reported