OpenSeeker-v2: Data-Centric LLM Search Agent
- The paper introduces OpenSeeker-v2, a 30-billion parameter model trained exclusively with supervised fine-tuning on synthetic, multi-hop search trajectories.
- It employs the ReAct framework, integrating planning and diverse tool actions within a single autoregressive decoder and a 256k token context window.
- The system outperforms prior models on multiple benchmarks, proving that rigorous data synthesis can drive state-of-the-art performance with modest compute.
OpenSeeker-v2 is a 30-billion parameter LLM based search agent developed using only supervised fine-tuning (SFT) on a dataset of high-difficulty, informative trajectories. Designed around the ReAct paradigm and capable of state-of-the-art (SOTA) performance across multiple benchmarks, OpenSeeker-v2 demonstrates that rigorous data synthesis, rather than multi-stage industrial training pipelines, can yield frontier web search agents with a modest compute and data regime (Du et al., 5 May 2026).
1. Model Architecture and Paradigm
OpenSeeker-v2 implements the ReAct framework, where the agent alternates between textual reasoning traces (, e.g., “Thought:” or “Plan:” statements), tool call actions (), and tool-generated observations () until a final answer () is produced. This agent is realized as a single autoregressive decoder (Qwen3-30B-A3B-Thinking) supporting a 256k token context window, enabling substantial intermixed planning and action sequences. Each inference session may involve up to tool calls, with each step representing the composition:
The overall multi-step trajectory is formalized as:
The training objective is standard autoregressive cross-entropy over teacher-forced trajectories:
This supervised fine-tuning protocol tightly integrates planning ("where next to look") and execution ("which tool to invoke") within a single LLM backbone (Du et al., 5 May 2026).
2. Data Synthesis Pipeline and Modifications
OpenSeeker-v2's core advance lies in the generation of challenging synthetic multiturn search trajectories, produced via three distinct interventions:
2.1. Knowledge-Graph Scaling
Given a global web graph , each synthetic scenario is generated by expanding a local subgraph from a seed node , previously using a small radius 0 (1). In v2, the expansion budget increases to 2, expanding the subgraph's node and edge count by a scaling factor 3, resulting in:
4
5
The synthetic question 6 is drawn from the distribution 7, ensuring that questions require deeper, multi-hop reasoning (i.e., evidence aggregation).
2.2. Expanded Tool-Set
The tool set is augmented from the original 8 by adding 9 new complementary primitives (e.g., "scroll," "extract_table," "summarize_page"), yielding:
0
This forces the agent to learn richer, more diverse action sequences through exposure to a broader range of tool interactions.
2.3. Strict Low-Step Filtering
To enforce higher task difficulty, trajectories 1 with tool-call length 2 below a threshold 3 are discarded:
4
This filtering excludes straightforward one- or two-step searches, elevating the minimum reasoning horizon and preventing overfitting to shallow tasks.
Collectively, these modifications yield a compact dataset of 10.6k synthesized question–trajectory pairs, each requiring sustained, multi-step reasoning (Du et al., 5 May 2026).
3. Training Protocol
OpenSeeker-v2 is trained exclusively via SFT on the aforementioned dataset. No continual pre-training (CPT), reinforcement learning (RL), or data augmentation beyond the default Qwen3 SFT settings is used. Specific hyperparameters—such as batch size, learning rate schedule, and the number of epochs—are not detailed and inherit their values from the Qwen3 SFT regime.
This minimalist training approach demonstrates that with high-difficulty, multi-hop trajectories, SFT alone can drive specialized LLM agents to SOTA capabilities (Du et al., 5 May 2026).
4. Benchmarks and Evaluation
OpenSeeker-v2 is evaluated on four deep-research agentic benchmarks, each employing exact-match accuracy as the primary metric:
| Benchmark | Language | Description | Metric |
|---|---|---|---|
| BrowseComp | English | Multi-hop, open-ended web browsing | Exact-match accuracy |
| BrowseComp-ZH | Chinese | As above, in Chinese | Exact-match accuracy |
| Humanity’s Last Exam | English | Long-horizon, multi-stage “exam” queries | Exact-match (whole question) |
| xbench-DeepSearch | mixed | Heterogeneous set: table extraction, synthesis | Macro-averaged exact-match |
Accuracy is calculated as the percentage of total test queries for which the model produces an exactly correct answer (Du et al., 5 May 2026).
5. Experimental Results and Comparisons
The following summarizes OpenSeeker-v2’s empirical results versus the previous SOTA Tongyi DeepResearch system (both are 30B ReAct agents):
| Model (Training Strategy) | BrowseComp | BrowseComp-ZH | HLE | xbench |
|---|---|---|---|---|
| Tongyi DeepResearch (CPT+SFT+RL) | 43.4 | 46.7 | 32.9 | 75.0 |
| OpenSeeker-v2 (10.6k SFT) | 46.0 | 58.1 | 34.6 | 78.0 |
OpenSeeker-v2 outperforms Tongyi DeepResearch by +2.6 points (BrowseComp), +11.4 (BrowseComp-ZH), +1.7 (HLE), and +3.0 (xbench) despite using only SFT and a significantly smaller training dataset, with no CPT or RL (Du et al., 5 May 2026). The report does not present formal statistical tests or confidence intervals, and isolated ablations for each data modification are not provided. Comparison to OpenSeeker-v1, which used 11.7k samples, demonstrates overall aggregate gains of +16.5, +9.7, and +4.0 points on BrowseComp, BrowseComp-ZH, and xbench respectively.
6. Theoretical and System Design Aspects
OpenSeeker-v2’s modular design can be contextualized within the framework of near-decomposability as articulated in the InfoSeeker blueprint (Lee et al., 3 Apr 2026). Complex agentic systems can be partitioned into semi-autonomous modules, allowing short-run independence among "Worker" entities executing atomic tool calls and long-run dependence via a "Host" coordinating high-level planning. The InfoSeeker-inspired architecture involves three tiers:
- Host (5): Maintains overall context, issues subqueries, and aggregates step outputs.
- Managers (6): Domain-specialist components responsible for decomposing tasks, reflecting on incomplete subtasks, and aggregating subtask results.
- Workers (7): Perform atomic tool-based operations in parallel, isolated from the Host and each other except via their subtask output.
Manager-level aggregation, reflection, and strict context isolation enforce error boundaries and prevent cascading failures. Parallel execution among Workers reduces overall latency by exploiting concurrency, with empirical speed-ups in related frameworks of 8 over baselines (Lee et al., 3 Apr 2026).
7. Implications, Limitations, and Future Directions
OpenSeeker-v2 serves as evidence that SFT on highly-informative, multi-hop synthetic trajectories is sufficient for training competitive frontier search agents at 30B parameter scale, without recourse to continual pre-training or reinforcement learning. The system’s performance advances indicate that careful design of data generation pipelines—specifically, knowledge-graph scaling, toolset enrichment, and stringent difficulty filtering—can substitute for expensive multi-stage industry protocols.
A plausible implication is the significant democratization of SOTA search-agent research. Academic and open-source teams with limited compute resources can now obtain comparable results by investing in data-centric rather than model-centric engineering. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent at this scale and paradigm developed by a purely academic team using SFT only (Du et al., 5 May 2026).
Limitations include the lack of formal statistical assessment in the reported results and the absence of modular ablation studies isolating the effect of each intervention. Full adoption of the near-decomposable architectural stack (Host/Manager/Worker), dynamic worker pool allocation, adaptive subtask decomposition, and cost-aware scheduling as proposed in the InfoSeeker blueprint constitute promising directions for both scalability and robustness (Lee et al., 3 Apr 2026).
OpenSeeker-v2’s open-sourced model weights and the documented data synthesis approach provide a foundation for future research in data-centric and modular agent design.