DeepDive-32B: Deep Search LLM Agent
- DeepDive-32B is an open-source 32B-parameter LLM that leverages knowledge graph-based question synthesis and multi-turn reinforcement learning for deep web search.
- It employs controlled random walks and auxiliary obfuscation to generate complex, multi-hop QA pairs with refined difficulty filtering.
- The system enhances search accuracy by scaling tool call budgets and using parallel sampling, demonstrated through competitive benchmarks like BrowseComp.
DeepDive-32B is an open-source, 32B-parameter LLM specifically designed to function as a deep search agent, integrating knowledge graph-derived data synthesis and multi-turn reinforcement learning (RL) to advance the capabilities of autonomous browsing agents. Distinct from prior systems, DeepDive-32B synthesizes complex question–answer (QA) pairs from multi-hop paths in open knowledge graphs (KGs), couples these with long-horizon multi-turn RL in a web environment, and demonstrates competitive performance on challenging benchmarks such as BrowseComp. The system's architecture enables scaling of tool calls at inference and leverages parallel sampling for improved robustness and efficiency. All datasets, models, and code are publicly released for further research and adoption (Lu et al., 12 Sep 2025).
1. Core Framework and System Design
DeepDive-32B implements a two-stage pipeline to enhance deep search in LLMs:
- Knowledge Graph–Based Data Synthesis: The system automatically generates complex, reasoning-intensive QA pairs via random walks on attribute-rich open KGs (e.g., KILT and AMiner). Paths longer than five hops are selected to ensure logical depth, with node attribute enrichment and deliberate content obfuscation via an auxiliary LLM (to create “blurry entities”). To filter out trivial instances, a frontier model is used for automated difficulty calibration by discarding questions that are too easily solved.
- End-to-End Multi-Turn RL Agent: DeepDive-32B is equipped with a multi-turn, web-interactive agent. At each step, the agent generates a chain-of-thought (), selects a browsing action (; e.g., search/click/open), and receives an observation (), continuing this process iteratively until termination. The agent is optimized using Grouped PPO (GRPO), a policy-gradient RL method, where rewards are assigned strictly: a trajectory receives +1 only if every step (formatting and answer) is correct; otherwise, it is assigned 0.
This design enables the agent to blend internal reasoning with external tool-based exploration in a web environment, systematically addressing long-horizon information retrieval and logical reasoning.
2. Automated Question Synthesis and Data Pipeline
The question generation pipeline is as follows:
- Controlled Random Walks: Each start node in the KG is expanded via a random walk to produce a path . Node attributes enrich path contexts.
- Blurring and Obfuscation: For each node, an LLM obfuscates key attributes to prevent direct lookup and induce ambiguity. The final QA pair is constructed with answer drawn from the terminal node and context cues masked or generalized.
- Difficulty Filtering: The QA pairs are solved multiple times by a strong reference LLM (e.g., GPT-4o); only those that yield “hard-to-find” targets (i.e., not solved on first attempts) are retained.
This process yields a dataset with multi-hop logical complexity and non-trivial ambiguity, accurately reflecting the demands encountered in real-world web search scenarios.
3. Multi-Turn Reinforcement Learning Strategy
End-to-end RL is central to DeepDive-32B’s ability to execute extended web interaction sequences:
- Trajectory Structure: The agent's execution on a query is modeled as a trajectory .
- RL Optimization Objective: For each trajectory with reward , the normalized advantage is . The policy update maximizes
where is the importance sampling ratio and is a reference policy.
- Reward Shaping: Only fully correct multi-turn trajectories receive +1; any formatting or answer errors result in an immediate 0, enforcing strict compliance and favoring robust, high-precision search strategies.
Empirically, RL steadily increases the agent’s depth of reasoning, as quantified by the average number of tool calls per question and overall accuracy on targeted benchmarks.
4. Performance Evaluation and Benchmarks
Key experimental findings include:
Model | BrowseComp Accuracy (%) | Notable Benchmark Comparison | Tool Call Scaling |
---|---|---|---|
DeepDive-32B | 14.8 | Outperforms WebSailor, DeepSeek-R1-Browse, and Search-o1 | Accuracy improves with more allowed tool calls |
- Benchmarks: DeepDive-32B is evaluated on BrowseComp, BrowseComp-ZH, Xbench-DeepSearch, and SEAL-0.
- Metrics: Metrics include accuracy (fraction of questions answered correctly), pass@1 rates, and average tool call count per question (measuring search horizon).
- Findings: RL-trained DeepDive-32B surpasses several proprietary and open baseline models; accuracy improves monotonically with increased tool call budgets.
This suggests that scaling tool call budgets at inference is an effective mechanism for improving deep search performance in LLM agents.
5. Tool Call Budget and Parallel Sampling Strategies
DeepDive-32B incorporates two test-time scaling techniques:
- Extended Tool Call Budgets: Increasing the cap on permissible tool calls per question systematically raises the likelihood of answer correctness, as the agent explores a deeper search tree and retrieves more relevant context.
- Parallel Sampling: Multiple independent multi-turn trajectories per input are generated (parallel sampling). Instead of selecting answers through majority voting, the answer requiring the minimal number of tool calls is chosen, which empirically yields the highest accuracy and reduces over-searching. This nearly doubles accuracy compared to a single sampled trajectory.
A plausible implication is that these techniques introduce robustness to uncertainty and provide an efficient means to exploit ensemble strategies within agentic systems.
6. Datasets, Models, and Resource Availability
All resources for DeepDive-32B are publicly accessible:
- Synthetic QA Corpora: Automatically generated, multi-hop, deep-search QA datasets for both English and Chinese (BrowseComp, BrowseComp-ZH) derived from open KGs.
- Model Artefacts: Weights for DeepDive-32B and related models (e.g., built on QwQ-32B, GLM-Z1-9B-0414), together with RL-trained agent-specific checkpoints.
- Software: The full framework for data synthesis, supervised fine-tuning, RL training pipelines, inference (with tool management and parallel decoding), and evaluation benchmarks is available at https://github.com/THUDM/DeepDive.
This open release enables rigorous reproducibility, comparative analysis, and straightforward adoption by the research community for exploration and further innovation in deep search agents.
7. Impact and Future Directions
DeepDive-32B establishes a new standard for aligning LLMs with external knowledge retrieval and web automation:
- Provides an effective blueprint for integrating knowledge graphs into training data pipelines and using RL to enhance long-horizon, multi-step agentic behavior.
- Demonstrates that test-time adaptation via tool call scaling and parallel sampling can lead to substantial gains in open-domain search tasks.
- Opens avenues for continued research into programmatic data augmentation, multi-agent collaboration, and the coordination of autonomous search with other external plugins and modalities.
The methodology and resources surrounding DeepDive-32B position it as a foundation for ongoing work in search-centric LLMs, knowledge graph reasoning, and deployed agentic applications (Lu et al., 12 Sep 2025).