SearchGym: Simulation for Fact-Based Search Agents

Updated 22 January 2026

SearchGym is a simulation environment defined by a verified knowledge graph and synthetic document corpus that enables reproducible, fact-grounded reasoning.
It models search tasks as a deterministic MDP with clearly defined state, action, and reward functions to benchmark agent performance.
Its curriculum learning and sim-to-real generalization methods significantly improve accuracy and sample efficiency on complex, multi-hop reasoning tasks.

SearchGym is a high-fidelity simulation environment purpose-built to enable robust, reproducible training of search agents for open-ended, knowledge-intensive reasoning tasks. Motivated by the systemic instability in reinforcement learning (RL) stemming from cost-prohibitive commercial API interactions and data misalignment in static web snapshots, SearchGym introduces a fully generative pipeline for constructing factually grounded, strictly solvable reasoning benchmarks. Complemented by a staged curriculum learning framework and demonstrated sim-to-real generalization, SearchGym establishes a new methodological paradigm for training and evaluating real-world search agents at scale (Zhang et al., 21 Jan 2026).

1. System Architecture and Generative Pipeline

The architectural foundation of SearchGym is a programmatically generated “synthetic world” $\mathcal{W} = \langle \mathcal{G}, \mathcal{D} \rangle$ , consisting of a verified knowledge graph $\mathcal{G}$ and an aligned document corpus $\mathcal{D}$ . The knowledge graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ is constructed atop a schema $\mathcal{S}$ that specifies entity types (e.g., Person, City, Country) and relation signatures with explicit cardinalities (1–1, 1–n, n–1). Approximately 3,600 entities are instantiated by sampling from attribute distributions (e.g., uniform birth year in $[1900, 2000]$ , Zipf-style degree distributions for relational edges).

Edges $e = (u \to v)$ are selected probabilistically: a relation type $r$ is drawn from $P(r)$ , and candidate targets $v$ are uniformly sampled among compatible nodes, with structural constraints (acyclicity, cardinality) enforced.

Each node $v \in \mathcal{V}$ is paired with a synthetic, Wikipedia-style document $d_v = M_{\mathrm{gen}}(v, \mathcal{N}_v)$ , where $M_{\mathrm{gen}}$ is a frozen LLM invoked on a prompt containing $v$ 's facts and its 1-hop neighborhood $\mathcal{N}_v$ . Each $d_v$ is assigned a unique URL to enable atomic Search and Access actions.

Factual alignment is enforced by a retrieval-based verification: for every edge $e$ , a set $\mathcal{Q}_e$ of 15 automatically generated queries is submitted to a retrieval engine $\mathcal{R}$ over $\mathcal{D}$ . The alignment score

$\operatorname{align}(e) = \left| \left\{ q \in \mathcal{Q}_e: d_v \in \mathrm{Top}-K \left( \mathcal{R}(q) \right) \right\} \right|$

filters edges into a verified subgraph $\mathcal{G}^*$ via thresholding ( $\operatorname{align}(e) \geq 5$ ), ensuring that all reasoning paths used for downstream RL are fully discoverable and free of retrieval artifacts. This rigorous generative pipeline decouples environment stochasticity from spurious document alignment errors, producing a provably verifiable world model (Zhang et al., 21 Jan 2026).

2. Environment Design: State, Actions, and Transitions

SearchGym environments are cast as Markov Decision Processes $(\mathcal{S}, \mathcal{A}, \mathcal{T}, R, \gamma)$ .

State ( $s_t$ ): At turn $t$ , the state comprises the question $Q$ , a set of visited URLs, a buffer of retrieved snippets, and the agent's internal step counter.
Action space ( $\mathcal{A}$ ): The atomic actions are
- $\texttt{Search}(q)$ : Issues a free-form query $q$ , returning top-5 snippet summaries with URLs.
- $\texttt{Access}(u)$ : Fetches the complete document $d_u$ at URL $u$ .
- $\texttt{Answer}(a)$ : Terminates the episode with answer $a$ .

State transitions are deterministic given $(\mathcal{D}, \mathcal{R})$ : $\mathcal{T}(s_t, \texttt{Search}(q))$ extends the snippet buffer, $\mathcal{T}(s_t, \texttt{Access}(u))$ augments seen documents, and $\texttt{Answer}(a)$ immediately ends the episode.

Reward ( $R(\mathcal{T})$ ): Sparse, terminal-only reward is implemented as the normalized token-level F1 between the agent's answer $\hat{A}$ and ground truth $A$ :

$P = \frac{|\text{tokens}(\hat{A}) \cap \text{tokens}(A)|}{|\text{tokens}(\hat{A})|}, \quad R = F_1 = \frac{2PR'}{P+R'}, \quad (R' = \text{recall})$

No intermediate shaping rewards are used, focusing credit assignment on successful reasoning chains.

By construction, every question-answer pair is sampled along a path $\mathcal{P} \subset \mathcal{G}^*$ , enforcing temporal consistency, linguistic clarity, and factual completeness. Every task is strictly solvable within the synthetic corpus (Zhang et al., 21 Jan 2026).

3. SearchGym-RL: Curriculum and Policy Optimization

Policy optimization in SearchGym is conducted via Group Relative Policy Optimization (GRPO), embedded in a two-stage curriculum:

Stage 1 (Foundational Skills): Training is limited to Simple QA instances (1–6 hops) until validation Pass@1 exceeds a threshold $\delta_1$ .
Stage 2 (Advanced Reasoning): The curriculum linearly increases the share of Parallel and Combo QA (6–12 hops), targeting complex long-horizon synthetic tasks.

The GRPO update employs $N$ trajectory samples, computes standardized reward advantages $\hat{A}_i$ , and applies the clipped-surrogate RL objective: $\mathcal{L}(\theta) = \mathbb{E}_i \left[ \min\left( \rho_i(\theta) \hat{A}_i, \mathrm{clip}(\rho_i(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_i \right) - \beta D_{KL}[ \pi_\theta(\mathcal{T}_i) \| \pi_{\text{ref}}(\mathcal{T}_i) ] \right]$ with $\epsilon=0.4$ , $\beta=0$ , and $\rho_i = \pi_\theta / \pi_{\theta_{\text{old}}}$ .

Ablation experiments indicate that omitting advanced curriculum (Stage 2) results in large performance drops (e.g., Pass@4 on GAIA from 42.72 to 28.16), confirming its necessity for training long-horizon planners (Zhang et al., 21 Jan 2026).

4. Sim-to-Real Generalization Protocol

Agents trained solely within SearchGym's simulated environment are evaluated in two real-world settings without RL fine-tuning:

Local Wikipedia: 2018 dump, retrieved using dense passage retrievers.
Live Web: Interfacing with Google Search API.

Evaluation spans 10 established question answering and research-oriented benchmarks, including NQ, TriviaQA, PopQA (single-hop), HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle (multi-hop), GAIA, and xbench-DeepSearch (deep research), plus the synthetic SearchGymBench suite. Metrics reported are Pass@1 (standard QA), Pass@4 (deep research), and efficiency statistics (#Search, #Access, tokens/query). Notably, the Qwen-2.5-7B-Base agent trained in SearchGym achieves a +10.6% average relative lift over the web-enhanced ASearcher baseline on 9 benchmarks, with an absolute gain of +3.89% on GAIA and +17 points on xbench (Pass@4), while simultaneously reducing average Search calls by 37.3% and eliminating API costs (Zhang et al., 21 Jan 2026).

No post-hoc RL adaptation is performed; improvements are robust across multiple seeds, indicating effective sim-to-real transfer.

5. Empirical Outcomes and Ablations

Quantitative results demonstrate that SearchGym-trained agents consistently outperform or match leading baselines, both in accuracy and sample efficiency. For example, on single- and multi-hop benchmarks, Qwen-2.5-7B-Base attains higher Pass@1 than ASearcher-web (e.g., 66.5 vs. 61.3 on HotpotQA, 74.4 vs. 67.7 on 2Wiki). On challenging synthetic and open-ended research tasks (GAIA, xbench), gains are larger, especially as curriculum depth and corpus coverage increase.

Ablation studies highlight:

The necessity of separate Access actions; collapsing into search-only yields a 5.12 percentage point drop on deep-research benchmarks.
Curriculum depth is critical; omitting advanced stage results in ~15-point declines.
No overfitting is observed as corpus coverage or max-hop exposure scale, implying factual and data-driven robustness.

Performance, efficiency, and robustness are summarized in the following tabulation (selected):

Method	HotpotQA	2Wiki	GAIA (Pass@4)	xbench (Pass@4)	Avg. Search	Avg. Access	Web Cost
ASearcher-web	61.3	67.7	38.83	32.00	5.92	0.07	>$500
Ours (Base)	66.5	74.4	42.72	49.00	3.71	0.90	$0

These results confirm that simulation-driven RL in verifiable, high-fidelity environments supports the development of performant, cost-effective search agents (Zhang et al., 21 Jan 2026).

6. Context and Significance

SearchGym addresses long-standing barriers in training search and reasoning agents:

Eliminates corrupted reward signals from web snapshot misalignment.
Grounds all tasks in provable answers with deterministic, reproducible evaluation.
Enables scalable, curriculum-guided RL with well-defined sample efficiency and action granularity metrics.

Compared to prior “SearchGym” frameworks targeting ML-driven hardware architecture search (e.g., ArchGym (Krishnan et al., 2023)), SearchGym is specialized for open-ended, factually anchored web-scale reasoning. The methodology, however, maintains structural parallels: both utilize standardized Gym-style APIs, verifiable task generation, and modular agent integration.

A plausible implication is that principled simulation environments, coupled with structured curriculum learning, will remain foundational for advancing real-world, high-reliability search and planning agents. SearchGym’s design and demonstrated sim-to-real generalization provide an empirical reference for future work in verifiable agent training, benchmark construction, and systematic RL evaluation.

Markdown Report Issue Upgrade to Chat

References (2)

SearchGym: Bootstrapping Real-World Search Agents via Cost-Effective and High-Fidelity Environment Simulation (2026)

ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SearchGym.

SearchGym: Simulation for Fact-Based Search Agents

1. System Architecture and Generative Pipeline

2. Environment Design: State, Actions, and Transitions

3. SearchGym-RL: Curriculum and Policy Optimization

4. Sim-to-Real Generalization Protocol

5. Empirical Outcomes and Ablations

6. Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SearchGym: Simulation for Fact-Based Search Agents

1. System Architecture and Generative Pipeline

2. Environment Design: State, Actions, and Transitions

3. SearchGym-RL: Curriculum and Policy Optimization

4. Sim-to-Real Generalization Protocol

5. Empirical Outcomes and Ablations

6. Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research