Fathom-Search-4B DeepSearch Agent
- Fathom-Search-4B is a Transformer-based deep search agent built on Qwen3-4B that integrates multi-turn reinforcement learning with verifiable rewards for enhanced web querying.
- It leverages a curated DuetQA dataset and specialized tokens to orchestrate in-context tool calls and maintain long-horizon search trajectories.
- The system demonstrates superior performance with extended tool use and advanced RL techniques, outperforming many open-source and proprietary search benchmarks.
Fathom-Search-4B is a 4 billion-parameter Transformer-based DeepSearch agent, developed as a key component in agentic information-seeking architectures, specifically as part of the Fathom-DeepResearch system. Built on the Qwen3-4B backbone, Fathom-Search-4B enables evidence-based live web search and targeted page querying within a large 40,960-token context window. Its policy is trained via multi-turn Reinforcement Learning with Verifiable Rewards (RLVR), integrating novel RL stabilization techniques, a curated web search–dependent QA dataset, and a steerable step-level reward system. The result is reliable long-horizon tool-use (exceeding 20 tool calls when justified), surpassing benchmark performance of prior open-source and several proprietary LLM–powered search agents (Singh et al., 28 Sep 2025).
1. Model Architecture and Tool Integration
Fathom-Search-4B utilizes the Qwen3-4B foundation, which includes 32 Transformer layers, hidden size , 32 attention heads, and a context length . The architecture maintains the original Gaussian-Error Linear Units (GELU) activations, rotary positional embeddings (with YaRN scaling applied in the synthesizer module only), and standard pre-norm attention mechanisms. The sole architectural modifications introduced during RLVR are the addition of two special vocabulary tokens—<tool_call> and <tool_response>—and the expansion of the output head to accept argument slots for two integrated web tools: search_urls (open-ended web search) and query_url (targeted retrieval from a specific page). These modifications facilitate in-context tool orchestration characteristic of DeepSearch agents.
2. DUETQA Dataset: Search Dependency and Grounding
To enforce genuine web search dependence and heterogeneous information sourcing, Fathom-Search-4B is trained using the DuetQA dataset. DuetQA comprises 4,988 samples generated through multi-agent self-play: two search-enabled models (M₁: o3, M₂: o4-mini) act as generators and verifiers, while a non-search LLM (M₃: GPT-4o) supplies paraphrase obfuscation and negative filtering. For each question, thematic nodes are sampled from a -node taxonomy, proceeding via a mixture-of-themes (chained multi-hop fact retrieval) or seeded-question (paraphrased post-2024 fact integration) protocol. Examples are retained if both search-enabled models converge on the answer via live querying, while the non-search model fails, enforcing that . DuetQA questions routinely require multiple (mean ≈6, max 32) search calls, and each is constructed to be irreducible to surface-level or single-source (e.g., Wikipedia) queries.
3. Reinforcement Learning with RAPO and GRPO
Policy for Fathom-Search-4B is optimized via Group-Relative Proximal Policy Optimization (GRPO), a trajectory-centric PPO variant. The loss is
with reward normalization within groups to control gradient scale. Vanilla reward comprises a format score (, ReAct tagging) and answer correctness ( as judged by an LLM). RAPO (Reward-Aware Policy Optimization) introduces three modifications: (a) curriculum pruning discards prompts solved with accuracy in an epoch, (b) reward-aware advantage scaling amplifies advantages from “good” groups, and (c) per-prompt replay buffers inject the last high-reward trajectory if all current rollouts fail, preserving gradient variance. These adaptations prevent policy collapse and stabilize RL in the low-variance regime typical of multi-turn tool use.
4. Steerable Step-Level Reward and Trajectory Control
The steerable step-level reward aims to mitigate reward hacking, encouraging exploration without redundant tool use and enabling explicit control over search trajectory breadth, depth, and horizon. Each search_urls call is labeled by an LLM-judge as UniqueSearch or RedundantSearch; query_url calls are labeled Exploration, Verification, or RedundantQuery. Reward aggregates include redundancy ratio 0, novel search delta 1, and useful page query delta 2. The per-trajectory reward 3 is
- If correct: 4
- Else: 5
with 6, 7, 8, 9 (max verifications per claim). Adjusting these coefficients (exposed as reward “knobs”) modulates the agent’s willingness to query deeply, broadly, or in long horizons. This approach enables extension of the average trajectory from ≈12 calls before stalling (with vanilla reward) to >20 calls (maximum 32) under the steerable regime.
5. Training Procedure and Infrastructure
Fathom-Search-4B is trained in two reinforcement learning stages. Stage 1 uses only DuetQA (4,988 examples, 10 epochs), with vanilla reward and group-based PPO optimization; each batch comprises 32 prompts with five rollouts per prompt, for a total step cap of 32 and maximum 8,192 tokens per rollout step. Stage 2 introduces a mixed, adversarially filtered pool (DuetQA, Stage 1 math, MuSiQue; 5,077 examples, 2 epochs), employing the steerable step-level reward. Both stages use the Adam optimizer with learning rate 0 and are implemented on ReCall infrastructure using APIs (Serper, Jina, Trafiltura, Crawl4AI) and a single node of 8× NVIDIA H100 GPUs.
6. Performance on Benchmarks and Comparative Analysis
Fathom-Search-4B is evaluated with Pass@1 accuracy using a GPT-4.1-mini judge (temperature 1). On DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue), Stage 1 achieves [88.1, 57.2, 39.0, 19.8, 31.3; Avg 47.1], outperforming all open-weights baselines. Stage 2 achieves [90.0, 64.8, 50.0, 22.5, 33.2; Avg 52.1], closing 20-point gaps to Qwen3-4B+search and outperforming closed-source GPT-4o+search (46.5) on most tasks. On DeepResearch-Bench, the integrated Fathom-DeepResearch system yields RACE-style “Overall” 45.47% and citation accuracy 56.1%, leading open-source systems and remaining competitive with proprietary counterparts.
| Model/Stage | DeepSearch Avg (Stage 2) | DeepResearch-Bench RACE Overall | FACT-C Acc / E Cit |
|---|---|---|---|
| Fathom-Search-4B | 52.1 | 45.47% | 56.1% / 38.3% |
| Qwen3-4B+search | 27.5 | — | — |
| GPT-4o+search (closed) | 46.5 | — | — |
| Kimi-Researcher | — | 44.64% | — |
This table summarizes key performance statistics as reported.
7. Limitations and Prospective Directions
The primary stabilization mechanism—RAPO—relies on per-prompt replay buffers and curriculum pruning, which may anchor the agent to low-entropy success trajectories. This can limit adaptation to increasingly difficult prompts and results in policy saturation before reaching the maximum allowed trajectory length under vanilla rewards. The current synchronous two-stage RL pipeline is efficient for moderate-scale training but becomes brittle and suboptimal at larger scales. Identified future directions include asynchronous, prioritized trajectory sampling, fine-grained continuous reward models for improved tool-parameter diversity, meta-RL strategies for query-specific reward adaptation, and expansion of tool sets to include modalities beyond web search (e.g., code execution, database queries) (Singh et al., 28 Sep 2025).
A plausible implication is that the steerable step-level reward framework could generalize to other multi-tool LLM agents, allowing explicit, RL-based regulation of evidence trajectory construction, though further empirical validation on out-of-domain benchmarks remains to be demonstrated.