Papers
Topics
Authors
Recent
Search
2000 character limit reached

SearchGym: Simulation for Fact-Based Search Agents

Updated 22 January 2026
  • SearchGym is a simulation environment defined by a verified knowledge graph and synthetic document corpus that enables reproducible, fact-grounded reasoning.
  • It models search tasks as a deterministic MDP with clearly defined state, action, and reward functions to benchmark agent performance.
  • Its curriculum learning and sim-to-real generalization methods significantly improve accuracy and sample efficiency on complex, multi-hop reasoning tasks.

SearchGym is a high-fidelity simulation environment purpose-built to enable robust, reproducible training of search agents for open-ended, knowledge-intensive reasoning tasks. Motivated by the systemic instability in reinforcement learning (RL) stemming from cost-prohibitive commercial API interactions and data misalignment in static web snapshots, SearchGym introduces a fully generative pipeline for constructing factually grounded, strictly solvable reasoning benchmarks. Complemented by a staged curriculum learning framework and demonstrated sim-to-real generalization, SearchGym establishes a new methodological paradigm for training and evaluating real-world search agents at scale (Zhang et al., 21 Jan 2026).

1. System Architecture and Generative Pipeline

The architectural foundation of SearchGym is a programmatically generated “synthetic world” W=G,D\mathcal{W} = \langle \mathcal{G}, \mathcal{D} \rangle, consisting of a verified knowledge graph G\mathcal{G} and an aligned document corpus D\mathcal{D}. The knowledge graph G=(V,E)\mathcal{G} = (\mathcal{V}, \mathcal{E}) is constructed atop a schema S\mathcal{S} that specifies entity types (e.g., Person, City, Country) and relation signatures with explicit cardinalities (1–1, 1–n, n–1). Approximately 3,600 entities are instantiated by sampling from attribute distributions (e.g., uniform birth year in [1900,2000][1900, 2000], Zipf-style degree distributions for relational edges).

Edges e=(uv)e = (u \to v) are selected probabilistically: a relation type rr is drawn from P(r)P(r), and candidate targets vv are uniformly sampled among compatible nodes, with structural constraints (acyclicity, cardinality) enforced.

Each node vVv \in \mathcal{V} is paired with a synthetic, Wikipedia-style document dv=Mgen(v,Nv)d_v = M_{\mathrm{gen}}(v, \mathcal{N}_v), where MgenM_{\mathrm{gen}} is a frozen LLM invoked on a prompt containing vv's facts and its 1-hop neighborhood Nv\mathcal{N}_v. Each dvd_v is assigned a unique URL to enable atomic Search and Access actions.

Factual alignment is enforced by a retrieval-based verification: for every edge ee, a set Qe\mathcal{Q}_e of 15 automatically generated queries is submitted to a retrieval engine R\mathcal{R} over D\mathcal{D}. The alignment score

align(e)={qQe:dvTopK(R(q))}\operatorname{align}(e) = \left| \left\{ q \in \mathcal{Q}_e: d_v \in \mathrm{Top}-K \left( \mathcal{R}(q) \right) \right\} \right|

filters edges into a verified subgraph G\mathcal{G}^* via thresholding (align(e)5\operatorname{align}(e) \geq 5), ensuring that all reasoning paths used for downstream RL are fully discoverable and free of retrieval artifacts. This rigorous generative pipeline decouples environment stochasticity from spurious document alignment errors, producing a provably verifiable world model (Zhang et al., 21 Jan 2026).

2. Environment Design: State, Actions, and Transitions

SearchGym environments are cast as Markov Decision Processes (S,A,T,R,γ)(\mathcal{S}, \mathcal{A}, \mathcal{T}, R, \gamma).

  • State (sts_t): At turn tt, the state comprises the question QQ, a set of visited URLs, a buffer of retrieved snippets, and the agent's internal step counter.
  • Action space (A\mathcal{A}): The atomic actions are
    • Search(q)\texttt{Search}(q): Issues a free-form query qq, returning top-5 snippet summaries with URLs.
    • Access(u)\texttt{Access}(u): Fetches the complete document dud_u at URL uu.
    • Answer(a)\texttt{Answer}(a): Terminates the episode with answer aa.

State transitions are deterministic given (D,R)(\mathcal{D}, \mathcal{R}): T(st,Search(q))\mathcal{T}(s_t, \texttt{Search}(q)) extends the snippet buffer, T(st,Access(u))\mathcal{T}(s_t, \texttt{Access}(u)) augments seen documents, and Answer(a)\texttt{Answer}(a) immediately ends the episode.

  • Reward (R(T)R(\mathcal{T})): Sparse, terminal-only reward is implemented as the normalized token-level F1 between the agent's answer A^\hat{A} and ground truth AA:

P=tokens(A^)tokens(A)tokens(A^),R=F1=2PRP+R,(R=recall)P = \frac{|\text{tokens}(\hat{A}) \cap \text{tokens}(A)|}{|\text{tokens}(\hat{A})|}, \quad R = F_1 = \frac{2PR'}{P+R'}, \quad (R' = \text{recall})

No intermediate shaping rewards are used, focusing credit assignment on successful reasoning chains.

By construction, every question-answer pair is sampled along a path PG\mathcal{P} \subset \mathcal{G}^*, enforcing temporal consistency, linguistic clarity, and factual completeness. Every task is strictly solvable within the synthetic corpus (Zhang et al., 21 Jan 2026).

3. SearchGym-RL: Curriculum and Policy Optimization

Policy optimization in SearchGym is conducted via Group Relative Policy Optimization (GRPO), embedded in a two-stage curriculum:

  • Stage 1 (Foundational Skills): Training is limited to Simple QA instances (1–6 hops) until validation Pass@1 exceeds a threshold δ1\delta_1.
  • Stage 2 (Advanced Reasoning): The curriculum linearly increases the share of Parallel and Combo QA (6–12 hops), targeting complex long-horizon synthetic tasks.

The GRPO update employs NN trajectory samples, computes standardized reward advantages A^i\hat{A}_i, and applies the clipped-surrogate RL objective: L(θ)=Ei[min(ρi(θ)A^i,clip(ρi(θ),1ϵ,1+ϵ)A^i)βDKL[πθ(Ti)πref(Ti)]]\mathcal{L}(\theta) = \mathbb{E}_i \left[ \min\left( \rho_i(\theta) \hat{A}_i, \mathrm{clip}(\rho_i(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_i \right) - \beta D_{KL}[ \pi_\theta(\mathcal{T}_i) \| \pi_{\text{ref}}(\mathcal{T}_i) ] \right] with ϵ=0.4\epsilon=0.4, β=0\beta=0, and ρi=πθ/πθold\rho_i = \pi_\theta / \pi_{\theta_{\text{old}}}.

Ablation experiments indicate that omitting advanced curriculum (Stage 2) results in large performance drops (e.g., Pass@4 on GAIA from 42.72 to 28.16), confirming its necessity for training long-horizon planners (Zhang et al., 21 Jan 2026).

4. Sim-to-Real Generalization Protocol

Agents trained solely within SearchGym's simulated environment are evaluated in two real-world settings without RL fine-tuning:

  • Local Wikipedia: 2018 dump, retrieved using dense passage retrievers.
  • Live Web: Interfacing with Google Search API.

Evaluation spans 10 established question answering and research-oriented benchmarks, including NQ, TriviaQA, PopQA (single-hop), HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle (multi-hop), GAIA, and xbench-DeepSearch (deep research), plus the synthetic SearchGymBench suite. Metrics reported are Pass@1 (standard QA), Pass@4 (deep research), and efficiency statistics (#Search, #Access, tokens/query). Notably, the Qwen-2.5-7B-Base agent trained in SearchGym achieves a +10.6% average relative lift over the web-enhanced ASearcher baseline on 9 benchmarks, with an absolute gain of +3.89% on GAIA and +17 points on xbench (Pass@4), while simultaneously reducing average Search calls by 37.3% and eliminating API costs (Zhang et al., 21 Jan 2026).

No post-hoc RL adaptation is performed; improvements are robust across multiple seeds, indicating effective sim-to-real transfer.

5. Empirical Outcomes and Ablations

Quantitative results demonstrate that SearchGym-trained agents consistently outperform or match leading baselines, both in accuracy and sample efficiency. For example, on single- and multi-hop benchmarks, Qwen-2.5-7B-Base attains higher Pass@1 than ASearcher-web (e.g., 66.5 vs. 61.3 on HotpotQA, 74.4 vs. 67.7 on 2Wiki). On challenging synthetic and open-ended research tasks (GAIA, xbench), gains are larger, especially as curriculum depth and corpus coverage increase.

Ablation studies highlight:

  • The necessity of separate Access actions; collapsing into search-only yields a 5.12 percentage point drop on deep-research benchmarks.
  • Curriculum depth is critical; omitting advanced stage results in ~15-point declines.
  • No overfitting is observed as corpus coverage or max-hop exposure scale, implying factual and data-driven robustness.

Performance, efficiency, and robustness are summarized in the following tabulation (selected):

Method HotpotQA 2Wiki GAIA (Pass@4) xbench (Pass@4) Avg. Search Avg. Access Web Cost
ASearcher-web 61.3 67.7 38.83 32.00 5.92 0.07 >$500
Ours (Base) 66.5 74.4 42.72 49.00 3.71 0.90 $0

These results confirm that simulation-driven RL in verifiable, high-fidelity environments supports the development of performant, cost-effective search agents (Zhang et al., 21 Jan 2026).

6. Context and Significance

SearchGym addresses long-standing barriers in training search and reasoning agents:

  • Eliminates corrupted reward signals from web snapshot misalignment.
  • Grounds all tasks in provable answers with deterministic, reproducible evaluation.
  • Enables scalable, curriculum-guided RL with well-defined sample efficiency and action granularity metrics.

Compared to prior “SearchGym” frameworks targeting ML-driven hardware architecture search (e.g., ArchGym (Krishnan et al., 2023)), SearchGym is specialized for open-ended, factually anchored web-scale reasoning. The methodology, however, maintains structural parallels: both utilize standardized Gym-style APIs, verifiable task generation, and modular agent integration.

A plausible implication is that principled simulation environments, coupled with structured curriculum learning, will remain foundational for advancing real-world, high-reliability search and planning agents. SearchGym’s design and demonstrated sim-to-real generalization provide an empirical reference for future work in verifiable agent training, benchmark construction, and systematic RL evaluation.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SearchGym.