DeepDive-32B: Deep Search LLM Agent

Updated 15 September 2025

DeepDive-32B is an open-source 32B-parameter LLM that leverages knowledge graph-based question synthesis and multi-turn reinforcement learning for deep web search.
It employs controlled random walks and auxiliary obfuscation to generate complex, multi-hop QA pairs with refined difficulty filtering.
The system enhances search accuracy by scaling tool call budgets and using parallel sampling, demonstrated through competitive benchmarks like BrowseComp.

DeepDive-32B is an open-source, 32B-parameter LLM specifically designed to function as a deep search agent, integrating knowledge graph-derived data synthesis and multi-turn reinforcement learning (RL) to advance the capabilities of autonomous browsing agents. Distinct from prior systems, DeepDive-32B synthesizes complex question–answer (QA) pairs from multi-hop paths in open knowledge graphs (KGs), couples these with long-horizon multi-turn RL in a web environment, and demonstrates competitive performance on challenging benchmarks such as BrowseComp. The system's architecture enables scaling of tool calls at inference and leverages parallel sampling for improved robustness and efficiency. All datasets, models, and code are publicly released for further research and adoption (Lu et al., 12 Sep 2025).

1. Core Framework and System Design

DeepDive-32B implements a two-stage pipeline to enhance deep search in LLMs:

Knowledge Graph–Based Data Synthesis: The system automatically generates complex, reasoning-intensive QA pairs via random walks on attribute-rich open KGs (e.g., KILT and AMiner). Paths longer than five hops are selected to ensure logical depth, with node attribute enrichment and deliberate content obfuscation via an auxiliary LLM (to create “blurry entities”). To filter out trivial instances, a frontier model is used for automated difficulty calibration by discarding questions that are too easily solved.
End-to-End Multi-Turn RL Agent: DeepDive-32B is equipped with a multi-turn, web-interactive agent. At each step, the agent generates a chain-of-thought ( $c_t$ ), selects a browsing action ( $a_t$ ; e.g., search/click/open), and receives an observation ( $o_t$ ), continuing this process iteratively until termination. The agent is optimized using Grouped PPO (GRPO), a policy-gradient RL method, where rewards are assigned strictly: a trajectory receives +1 only if every step (formatting and answer) is correct; otherwise, it is assigned 0.

This design enables the agent to blend internal reasoning with external tool-based exploration in a web environment, systematically addressing long-horizon information retrieval and logical reasoning.

2. Automated Question Synthesis and Data Pipeline

The question generation pipeline is as follows:

Controlled Random Walks: Each start node $v_0$ in the KG $G = (V, E)$ is expanded via a random walk to produce a path $P = [v_0, v_1, ..., v_k]$ . Node attributes $A(v_i) = [a^i_0, a^i_1, ...]$ enrich path contexts.
Blurring and Obfuscation: For each node, an LLM obfuscates key attributes to prevent direct lookup and induce ambiguity. The final QA pair $(q, a^k_i)$ is constructed with answer $a^k_i$ drawn from the terminal node and context cues masked or generalized.
Difficulty Filtering: The QA pairs are solved multiple times by a strong reference LLM (e.g., GPT-4o); only those that yield “hard-to-find” targets (i.e., not solved on first attempts) are retained.

This process yields a dataset with multi-hop logical complexity and non-trivial ambiguity, accurately reflecting the demands encountered in real-world web search scenarios.

3. Multi-Turn Reinforcement Learning Strategy

End-to-end RL is central to DeepDive-32B’s ability to execute extended web interaction sequences:

Trajectory Structure: The agent's execution on a query is modeled as a trajectory $\mathcal{T} = [q, (c_1, a_1, o_1),\ldots,(c_m, a_m, o_m), c_{\text{ans}}, a_{\text{eos}}]$ .
RL Optimization Objective: For each trajectory $i$ with reward $r_i$ , the normalized advantage is $A_i = \frac{r_i - \operatorname{mean}(\{r_k\})}{\operatorname{std}(\{r_k\})}$ . The policy update maximizes

$\mathcal{L}(\theta) = \frac{1}{G}\sum_i \left[ \min(\rho_i A_i, \operatorname{clip}(\rho_i, 1-\epsilon, 1+\epsilon)A_i) - \beta \mathrm{KL}(\pi_\theta \,\|\, \pi_{\mathrm{ref}}) \right]$

where $\rho_i$ is the importance sampling ratio and $\pi_{\mathrm{ref}}$ is a reference policy.

Reward Shaping: Only fully correct multi-turn trajectories receive +1; any formatting or answer errors result in an immediate 0, enforcing strict compliance and favoring robust, high-precision search strategies.

Empirically, RL steadily increases the agent’s depth of reasoning, as quantified by the average number of tool calls per question and overall accuracy on targeted benchmarks.

4. Performance Evaluation and Benchmarks

Key experimental findings include:

Model	BrowseComp Accuracy (%)	Notable Benchmark Comparison	Tool Call Scaling
DeepDive-32B	14.8	Outperforms WebSailor, DeepSeek-R1-Browse, and Search-o1	Accuracy improves with more allowed tool calls

Benchmarks: DeepDive-32B is evaluated on BrowseComp, BrowseComp-ZH, Xbench-DeepSearch, and SEAL-0.
Metrics: Metrics include accuracy (fraction of questions answered correctly), pass@1 rates, and average tool call count per question (measuring search horizon).
Findings: RL-trained DeepDive-32B surpasses several proprietary and open baseline models; accuracy improves monotonically with increased tool call budgets.

This suggests that scaling tool call budgets at inference is an effective mechanism for improving deep search performance in LLM agents.

5. Tool Call Budget and Parallel Sampling Strategies

DeepDive-32B incorporates two test-time scaling techniques:

Extended Tool Call Budgets: Increasing the cap on permissible tool calls per question systematically raises the likelihood of answer correctness, as the agent explores a deeper search tree and retrieves more relevant context.
Parallel Sampling: Multiple independent multi-turn trajectories per input are generated (parallel sampling). Instead of selecting answers through majority voting, the answer requiring the minimal number of tool calls is chosen, which empirically yields the highest accuracy and reduces over-searching. This nearly doubles accuracy compared to a single sampled trajectory.

A plausible implication is that these techniques introduce robustness to uncertainty and provide an efficient means to exploit ensemble strategies within agentic systems.

6. Datasets, Models, and Resource Availability

All resources for DeepDive-32B are publicly accessible:

Synthetic QA Corpora: Automatically generated, multi-hop, deep-search QA datasets for both English and Chinese (BrowseComp, BrowseComp-ZH) derived from open KGs.
Model Artefacts: Weights for DeepDive-32B and related models (e.g., built on QwQ-32B, GLM-Z1-9B-0414), together with RL-trained agent-specific checkpoints.
Software: The full framework for data synthesis, supervised fine-tuning, RL training pipelines, inference (with tool management and parallel decoding), and evaluation benchmarks is available at https://github.com/THUDM/DeepDive.

This open release enables rigorous reproducibility, comparative analysis, and straightforward adoption by the research community for exploration and further innovation in deep search agents.

7. Impact and Future Directions

DeepDive-32B establishes a new standard for aligning LLMs with external knowledge retrieval and web automation:

Provides an effective blueprint for integrating knowledge graphs into training data pipelines and using RL to enhance long-horizon, multi-step agentic behavior.
Demonstrates that test-time adaptation via tool call scaling and parallel sampling can lead to substantial gains in open-domain search tasks.
Opens avenues for continued research into programmatic data augmentation, multi-agent collaboration, and the coordination of autonomous search with other external plugins and modalities.

The methodology and resources surrounding DeepDive-32B position it as a foundation for ongoing work in search-centric LLMs, knowledge graph reasoning, and deployed agentic applications (Lu et al., 12 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL (2025)

Follow Topic

Get notified by email when new papers are published related to DeepDive-32B.