SearchGym: Simulation for Fact-Based Search Agents
- SearchGym is a simulation environment defined by a verified knowledge graph and synthetic document corpus that enables reproducible, fact-grounded reasoning.
- It models search tasks as a deterministic MDP with clearly defined state, action, and reward functions to benchmark agent performance.
- Its curriculum learning and sim-to-real generalization methods significantly improve accuracy and sample efficiency on complex, multi-hop reasoning tasks.
SearchGym is a high-fidelity simulation environment purpose-built to enable robust, reproducible training of search agents for open-ended, knowledge-intensive reasoning tasks. Motivated by the systemic instability in reinforcement learning (RL) stemming from cost-prohibitive commercial API interactions and data misalignment in static web snapshots, SearchGym introduces a fully generative pipeline for constructing factually grounded, strictly solvable reasoning benchmarks. Complemented by a staged curriculum learning framework and demonstrated sim-to-real generalization, SearchGym establishes a new methodological paradigm for training and evaluating real-world search agents at scale (Zhang et al., 21 Jan 2026).
1. System Architecture and Generative Pipeline
The architectural foundation of SearchGym is a programmatically generated “synthetic world” , consisting of a verified knowledge graph and an aligned document corpus . The knowledge graph is constructed atop a schema that specifies entity types (e.g., Person, City, Country) and relation signatures with explicit cardinalities (1–1, 1–n, n–1). Approximately 3,600 entities are instantiated by sampling from attribute distributions (e.g., uniform birth year in , Zipf-style degree distributions for relational edges).
Edges are selected probabilistically: a relation type is drawn from , and candidate targets are uniformly sampled among compatible nodes, with structural constraints (acyclicity, cardinality) enforced.
Each node is paired with a synthetic, Wikipedia-style document , where is a frozen LLM invoked on a prompt containing 's facts and its 1-hop neighborhood . Each is assigned a unique URL to enable atomic Search and Access actions.
Factual alignment is enforced by a retrieval-based verification: for every edge , a set of 15 automatically generated queries is submitted to a retrieval engine over . The alignment score
filters edges into a verified subgraph via thresholding (), ensuring that all reasoning paths used for downstream RL are fully discoverable and free of retrieval artifacts. This rigorous generative pipeline decouples environment stochasticity from spurious document alignment errors, producing a provably verifiable world model (Zhang et al., 21 Jan 2026).
2. Environment Design: State, Actions, and Transitions
SearchGym environments are cast as Markov Decision Processes .
- State (): At turn , the state comprises the question , a set of visited URLs, a buffer of retrieved snippets, and the agent's internal step counter.
- Action space (): The atomic actions are
- : Issues a free-form query , returning top-5 snippet summaries with URLs.
- : Fetches the complete document at URL .
- : Terminates the episode with answer .
State transitions are deterministic given : extends the snippet buffer, augments seen documents, and immediately ends the episode.
- Reward (): Sparse, terminal-only reward is implemented as the normalized token-level F1 between the agent's answer and ground truth :
No intermediate shaping rewards are used, focusing credit assignment on successful reasoning chains.
By construction, every question-answer pair is sampled along a path , enforcing temporal consistency, linguistic clarity, and factual completeness. Every task is strictly solvable within the synthetic corpus (Zhang et al., 21 Jan 2026).
3. SearchGym-RL: Curriculum and Policy Optimization
Policy optimization in SearchGym is conducted via Group Relative Policy Optimization (GRPO), embedded in a two-stage curriculum:
- Stage 1 (Foundational Skills): Training is limited to Simple QA instances (1–6 hops) until validation Pass@1 exceeds a threshold .
- Stage 2 (Advanced Reasoning): The curriculum linearly increases the share of Parallel and Combo QA (6–12 hops), targeting complex long-horizon synthetic tasks.
The GRPO update employs trajectory samples, computes standardized reward advantages , and applies the clipped-surrogate RL objective: with , , and .
Ablation experiments indicate that omitting advanced curriculum (Stage 2) results in large performance drops (e.g., Pass@4 on GAIA from 42.72 to 28.16), confirming its necessity for training long-horizon planners (Zhang et al., 21 Jan 2026).
4. Sim-to-Real Generalization Protocol
Agents trained solely within SearchGym's simulated environment are evaluated in two real-world settings without RL fine-tuning:
- Local Wikipedia: 2018 dump, retrieved using dense passage retrievers.
- Live Web: Interfacing with Google Search API.
Evaluation spans 10 established question answering and research-oriented benchmarks, including NQ, TriviaQA, PopQA (single-hop), HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle (multi-hop), GAIA, and xbench-DeepSearch (deep research), plus the synthetic SearchGymBench suite. Metrics reported are Pass@1 (standard QA), Pass@4 (deep research), and efficiency statistics (#Search, #Access, tokens/query). Notably, the Qwen-2.5-7B-Base agent trained in SearchGym achieves a +10.6% average relative lift over the web-enhanced ASearcher baseline on 9 benchmarks, with an absolute gain of +3.89% on GAIA and +17 points on xbench (Pass@4), while simultaneously reducing average Search calls by 37.3% and eliminating API costs (Zhang et al., 21 Jan 2026).
No post-hoc RL adaptation is performed; improvements are robust across multiple seeds, indicating effective sim-to-real transfer.
5. Empirical Outcomes and Ablations
Quantitative results demonstrate that SearchGym-trained agents consistently outperform or match leading baselines, both in accuracy and sample efficiency. For example, on single- and multi-hop benchmarks, Qwen-2.5-7B-Base attains higher Pass@1 than ASearcher-web (e.g., 66.5 vs. 61.3 on HotpotQA, 74.4 vs. 67.7 on 2Wiki). On challenging synthetic and open-ended research tasks (GAIA, xbench), gains are larger, especially as curriculum depth and corpus coverage increase.
Ablation studies highlight:
- The necessity of separate Access actions; collapsing into search-only yields a 5.12 percentage point drop on deep-research benchmarks.
- Curriculum depth is critical; omitting advanced stage results in ~15-point declines.
- No overfitting is observed as corpus coverage or max-hop exposure scale, implying factual and data-driven robustness.
Performance, efficiency, and robustness are summarized in the following tabulation (selected):
| Method | HotpotQA | 2Wiki | GAIA (Pass@4) | xbench (Pass@4) | Avg. Search | Avg. Access | Web Cost |
|---|---|---|---|---|---|---|---|
| ASearcher-web | 61.3 | 67.7 | 38.83 | 32.00 | 5.92 | 0.07 | >$500 |
| Ours (Base) | 66.5 | 74.4 | 42.72 | 49.00 | 3.71 | 0.90 | $0 |
These results confirm that simulation-driven RL in verifiable, high-fidelity environments supports the development of performant, cost-effective search agents (Zhang et al., 21 Jan 2026).
6. Context and Significance
SearchGym addresses long-standing barriers in training search and reasoning agents:
- Eliminates corrupted reward signals from web snapshot misalignment.
- Grounds all tasks in provable answers with deterministic, reproducible evaluation.
- Enables scalable, curriculum-guided RL with well-defined sample efficiency and action granularity metrics.
Compared to prior “SearchGym” frameworks targeting ML-driven hardware architecture search (e.g., ArchGym (Krishnan et al., 2023)), SearchGym is specialized for open-ended, factually anchored web-scale reasoning. The methodology, however, maintains structural parallels: both utilize standardized Gym-style APIs, verifiable task generation, and modular agent integration.
A plausible implication is that principled simulation environments, coupled with structured curriculum learning, will remain foundational for advancing real-world, high-reliability search and planning agents. SearchGym’s design and demonstrated sim-to-real generalization provide an empirical reference for future work in verifiable agent training, benchmark construction, and systematic RL evaluation.