Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fathom-Search-4B DeepSearch Agent

Updated 3 July 2026
  • Fathom-Search-4B is a Transformer-based deep search agent built on Qwen3-4B that integrates multi-turn reinforcement learning with verifiable rewards for enhanced web querying.
  • It leverages a curated DuetQA dataset and specialized tokens to orchestrate in-context tool calls and maintain long-horizon search trajectories.
  • The system demonstrates superior performance with extended tool use and advanced RL techniques, outperforming many open-source and proprietary search benchmarks.

Fathom-Search-4B is a 4 billion-parameter Transformer-based DeepSearch agent, developed as a key component in agentic information-seeking architectures, specifically as part of the Fathom-DeepResearch system. Built on the Qwen3-4B backbone, Fathom-Search-4B enables evidence-based live web search and targeted page querying within a large 40,960-token context window. Its policy is trained via multi-turn Reinforcement Learning with Verifiable Rewards (RLVR), integrating novel RL stabilization techniques, a curated web search–dependent QA dataset, and a steerable step-level reward system. The result is reliable long-horizon tool-use (exceeding 20 tool calls when justified), surpassing benchmark performance of prior open-source and several proprietary LLM–powered search agents (Singh et al., 28 Sep 2025).

1. Model Architecture and Tool Integration

Fathom-Search-4B utilizes the Qwen3-4B foundation, which includes 32 Transformer layers, hidden size d=4096d=4096, 32 attention heads, and a context length L=40960L=40\,960. The architecture maintains the original Gaussian-Error Linear Units (GELU) activations, rotary positional embeddings (with YaRN scaling applied in the synthesizer module only), and standard pre-norm attention mechanisms. The sole architectural modifications introduced during RLVR are the addition of two special vocabulary tokens—<tool_call> and <tool_response>—and the expansion of the output head to accept argument slots for two integrated web tools: search_urls (open-ended web search) and query_url (targeted retrieval from a specific page). These modifications facilitate in-context tool orchestration characteristic of DeepSearch agents.

2. DUETQA Dataset: Search Dependency and Grounding

To enforce genuine web search dependence and heterogeneous information sourcing, Fathom-Search-4B is trained using the DuetQA dataset. DuetQA comprises 4,988 samples generated through multi-agent self-play: two search-enabled models (M₁: o3, M₂: o4-mini) act as generators and verifiers, while a non-search LLM (M₃: GPT-4o) supplies paraphrase obfuscation and negative filtering. For each question, k{5,6,7}k \in \{5, 6, 7\} thematic nodes are sampled from a >200>200-node taxonomy, proceeding via a mixture-of-themes (chained multi-hop fact retrieval) or seeded-question (paraphrased post-2024 fact integration) protocol. Examples are retained if both search-enabled models converge on the answer via live querying, while the non-search model fails, enforcing that P(aq,Mno-search)P(aq,Msearch)\mathbb{P}(a|q, M_\text{no-search}) \ll \mathbb{P}(a|q, M_\text{search}). DuetQA questions routinely require multiple (mean ≈6, max 32) search calls, and each is constructed to be irreducible to surface-level or single-source (e.g., Wikipedia) queries.

3. Reinforcement Learning with RAPO and GRPO

Policy πθ\pi_\theta for Fathom-Search-4B is optimized via Group-Relative Proximal Policy Optimization (GRPO), a trajectory-centric PPO variant. The loss is

LGRPO=1Gi=1G1Tit=1Timin[ri,tA^i,t,clip(ri,t,1ϵ,1+ϵ)A^i,t]L_{\text{GRPO}} = \frac{1}{G} \sum_{i=1}^G \frac{1}{T_i} \sum_{t=1}^{T_i} \min \left[ r_{i,t}\,\hat{A}_{i,t}, \text{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon)\,\hat{A}_{i,t} \right]

with reward normalization within groups to control gradient scale. Vanilla reward comprises a format score (Riformat{0,1}R^{\text{format}}_i \in \{0, 1\}, ReAct tagging) and answer correctness (Rianswer{0,1}R^{\text{answer}}_i \in \{0, 1\} as judged by an LLM). RAPO (Reward-Aware Policy Optimization) introduces three modifications: (a) curriculum pruning discards prompts solved with >90%>90\% accuracy in an epoch, (b) reward-aware advantage scaling amplifies advantages from “good” groups, and (c) per-prompt replay buffers inject the last high-reward trajectory if all current rollouts fail, preserving gradient variance. These adaptations prevent policy collapse and stabilize RL in the low-variance regime typical of multi-turn tool use.

4. Steerable Step-Level Reward and Trajectory Control

The steerable step-level reward aims to mitigate reward hacking, encouraging exploration without redundant tool use and enabling explicit control over search trajectory breadth, depth, and horizon. Each search_urls call is labeled by an LLM-judge as UniqueSearch or RedundantSearch; query_url calls are labeled Exploration, Verification, or RedundantQuery. Reward aggregates include redundancy ratio L=40960L=40\,9600, novel search delta L=40960L=40\,9601, and useful page query delta L=40960L=40\,9602. The per-trajectory reward L=40960L=40\,9603 is

  • If correct: L=40960L=40\,9604
  • Else: L=40960L=40\,9605

with L=40960L=40\,9606, L=40960L=40\,9607, L=40960L=40\,9608, L=40960L=40\,9609 (max verifications per claim). Adjusting these coefficients (exposed as reward “knobs”) modulates the agent’s willingness to query deeply, broadly, or in long horizons. This approach enables extension of the average trajectory from ≈12 calls before stalling (with vanilla reward) to >20 calls (maximum 32) under the steerable regime.

5. Training Procedure and Infrastructure

Fathom-Search-4B is trained in two reinforcement learning stages. Stage 1 uses only DuetQA (4,988 examples, 10 epochs), with vanilla reward and group-based PPO optimization; each batch comprises 32 prompts with five rollouts per prompt, for a total step cap of 32 and maximum 8,192 tokens per rollout step. Stage 2 introduces a mixed, adversarially filtered pool (DuetQA, Stage 1 math, MuSiQue; 5,077 examples, 2 epochs), employing the steerable step-level reward. Both stages use the Adam optimizer with learning rate k{5,6,7}k \in \{5, 6, 7\}0 and are implemented on ReCall infrastructure using APIs (Serper, Jina, Trafiltura, Crawl4AI) and a single node of 8× NVIDIA H100 GPUs.

6. Performance on Benchmarks and Comparative Analysis

Fathom-Search-4B is evaluated with Pass@1 accuracy using a GPT-4.1-mini judge (temperature k{5,6,7}k \in \{5, 6, 7\}1). On DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue), Stage 1 achieves [88.1, 57.2, 39.0, 19.8, 31.3; Avg 47.1], outperforming all open-weights baselines. Stage 2 achieves [90.0, 64.8, 50.0, 22.5, 33.2; Avg 52.1], closing 20-point gaps to Qwen3-4B+search and outperforming closed-source GPT-4o+search (46.5) on most tasks. On DeepResearch-Bench, the integrated Fathom-DeepResearch system yields RACE-style “Overall” 45.47% and citation accuracy 56.1%, leading open-source systems and remaining competitive with proprietary counterparts.

Model/Stage DeepSearch Avg (Stage 2) DeepResearch-Bench RACE Overall FACT-C Acc / E Cit
Fathom-Search-4B 52.1 45.47% 56.1% / 38.3%
Qwen3-4B+search 27.5
GPT-4o+search (closed) 46.5
Kimi-Researcher 44.64%

This table summarizes key performance statistics as reported.

7. Limitations and Prospective Directions

The primary stabilization mechanism—RAPO—relies on per-prompt replay buffers and curriculum pruning, which may anchor the agent to low-entropy success trajectories. This can limit adaptation to increasingly difficult prompts and results in policy saturation before reaching the maximum allowed trajectory length under vanilla rewards. The current synchronous two-stage RL pipeline is efficient for moderate-scale training but becomes brittle and suboptimal at larger scales. Identified future directions include asynchronous, prioritized trajectory sampling, fine-grained continuous reward models for improved tool-parameter diversity, meta-RL strategies for query-specific reward adaptation, and expansion of tool sets to include modalities beyond web search (e.g., code execution, database queries) (Singh et al., 28 Sep 2025).

A plausible implication is that the steerable step-level reward framework could generalize to other multi-tool LLM agents, allowing explicit, RL-based regulation of evidence trajectory construction, though further empirical validation on out-of-domain benchmarks remains to be demonstrated.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fathom-Search-4B.