AgentGym: Open-Source LLM Agent Suite

Updated 29 December 2025

AgentGym (R2E-Gym) is a suite of open-source frameworks that train and evaluate LLM-based agents on diverse, long-horizon real-world tasks.
It features modular, procedural environment construction with unified APIs, enabling reproducible benchmarks and multi-task agent training.
The framework leverages self-evolution, reinforcement learning, and hybrid test-time verification to achieve state-of-the-art performance in both SWE and generalist domains.

AgentGym, also referenced as R2E-Gym and AgentGym-RL, is a suite of open-source frameworks and large-scale datasets for training, evolving, and evaluating LLM-based agents on diverse, long-horizon, and real-world tasks. It combines interactive, multi-task gym-like environments, unified benchmark and trajectory repositories, and specialized training protocols ranging from supervised behavioral cloning to reinforcement learning and hybrid verification techniques. AgentGym targets both software engineering (SWE) agents for real-world code repair and generalist LLM agents for domains such as web navigation, embodied reasoning, scientific experimentation, and tool use. The primary contributions include scalable procedural environment construction (SYNGEN), multi-modal environment APIs, self-evolution algorithms for agent improvement, and hybrid test-time verifiers. These enable both reproduction of prior results and scaling open-weight agents to match or surpass established proprietary systems (Jain et al., 9 Apr 2025, Xi et al., 10 Sep 2025, Xi et al., 6 Jun 2024).

1. System Architecture and Environment Construction

AgentGym frameworks decouple environment simulation, agent interface, and learning/training components. Each environment is deployed as a standalone HTTP service exposing standardized routes for environment reset, state observation, available actions, and agent action execution. The architecture enables modular integration and cross-task compatibility: new environments integrate by subclassing a base environment client and implementing the required API methods. Observations and actions are serialized in a unified JSON format and compatible with ReAct-style prompt engineering.

SYNGEN for Procedural SWE Environments

R2E-Gym’s core environment builder, SYNGEN, systematically curates executable software engineering problems from openly available GitHub repositories. The construction pipeline involves:

Repository & Commit Selection: Large commit histories are programmatically filtered via heuristics (maximum changed lines, files, and AST-detected code diffs) plus LLM-based commit quality judges.
Build-Environment Curation: For each candidate commit, multiple dependency configurations are iteratively tested in Docker containers to maximize the executable yield.
Test Extraction & Generation: Both human-authored and LLM-generated failing-to-passing (F2P) tests are curated, ensuring availability of high-fidelity regression and reproduction test suites.
Back-Translation of Issues: Issue descriptions are synthesized via LLM prompting over commit diffs and test traces, decoupling environment curation from requiring explicit human-written issues. This expands the dataset to >8,000 unique, procedurally verifiable SWE problems (Jain et al., 9 Apr 2025).

A similar modularization underpins the domain-general AgentGym suite, which incorporates 14 environments spanning web interaction (WebShop, WebArena), grid-world navigation (BabyAI), embodied tasks, scientific simulation (SciWorld), and digital games (TextCraft), each exposed via a uniform multi-turn agent API (Xi et al., 6 Jun 2024).

2. Instruction, Trajectory, and Benchmarking Databases

Both R2E-Gym and AgentGym include expansive databases:

Instructions: Natural language instructions for each environment/task, typically expanded via LLM self-instruct methods for breadth and diversity.
Trajectories: High-quality (expert LLM or human-annotated) trajectories in a unified format: sequences of agent observations, actions, and intermediate thoughts, explicitly paired with outcome rewards.
Benchmark Suite: AgentEval, a held-out set of challenging instructions and environments, standardized for cross-model evaluation; primary metrics are success rate or normalized reward.

Quantitative summaries of the environment and data scale include:

Environment	Tasks	Instructions	Benchmark	Trajectories	Avg Max Steps
WebShop	1	6,910	200	1,000	10
ALFWorld	6	3,827	200	500	30
SciWorld	30	2,320	200	1,000	30
BabyAI	40	900	90	400	20
TextCraft	1	544	100	300	20
BIRD	1	3,200	200	2,000	1

(Xi et al., 6 Jun 2024)

3. Agent Training and Self-Evolution Protocols

AgentGym implements several agent learning paradigms, each tailored to the limitations of standard imitation or RL techniques.

Supervised Behavioral Cloning

Agents are initialized via supervised fine-tuning on collected expert trajectories, maximizing:

$J_{BC}(\theta) = \mathbb{E}_{(e,u,\tau)} \left[ \sum_{t=1}^T \log \pi_\theta(h_t,a_t | \cdot) \right]$

where $(h_t, a_t)$ are thought-action pairs in the agent's trajectory, and $\pi_\theta$ is the agent's parameterized policy (Xi et al., 6 Jun 2024).

AgentEvol: Inference-Inspired Self-Evolution

AgentEvol is a scalable reward-weighted learning process modeled as probabilistic inference. Each iteration alternates:

Exploration: Generating new trajectories under the current policy $\pi_{\theta^m}$ on unseen instructions, storing observation-action pairs with observed rewards.
Learning: Updating the agent policy to maximize the reward-weighted log-likelihood over merged new and seed trajectory datasets:

$\theta^{m+1} = \arg \max_\theta \mathbb{E}_{(e,u,\tau) \sim D_m} [ r(e,u,\tau) \cdot \log \pi_\theta(\tau|e,u) ]$

This process empirically improves agent generalization, raising success rates across diverse environments beyond behavioral cloning upper bounds (Xi et al., 6 Jun 2024).

RL with ScalingInter-RL Curriculum

AgentGym-RL frames each task as a POMDP and applies on-policy RL algorithms (PPO, GRPO, REINFORCE++, RLOO) with a horizon-scaling schedule:

Early training caps interaction length to encourage exploitation (short reasoning chains, basic competence).
The horizon is incrementally increased per phase, forcing exploration of longer reasoning chains, and supporting complex strategy acquisition.

Formally, for phase $t$ :

$h_{t+1} = h_t + \delta_h$

Empirical evaluation demonstrates substantial gains in stability and final success rates compared to fixed-horizon RL schedules (Xi et al., 10 Sep 2025).

4. Test-Time Verification and Hybrid Scoring

R2E-Gym introduces a two-axis, hybrid test-time scaling methodology for ranking and selecting agent outputs (patches):

Execution-Based Verifier (EB): Runs agent-generated patches against automatically generated and regression tests, returning a “TestScore” for each candidate. Robustness is limited by low distinguishability (≤20% of tests separate correct vs. incorrect patches) and test set toxicity (up to 10% favor incorrect outputs).
Execution-Free Verifier (EF): Uses a fine-tuned LLM to assign $\mathrm{P(YES)}$ and $\mathrm{P(NO)}$ to the tuple (issue, trajectory, patch); $s_k^{EF} = \mathrm{P(YES)}/(\mathrm{P(YES)}+\mathrm{P(NO)})$ . While better at ranking, this method is susceptible to heuristics and stylistic bias.
Hybrid Verifier (H): Combines the above by filtering to top $n$ EF patches and adding EB test signals only on the most promising candidates:

$s_k^H = \begin{cases} s_k^{EB} + s_k^{EF} & \text{if } s_k^{EF} \text{ among top } n \ -\infty & \text{otherwise} \end{cases}$

This hybridization yields significant improvement: on SWE-Bench-Verified, the hybrid achieves Best@26 of 51.0%, outperforming the individual verifiers (≈43% each) and establishing state-of-the-art open-weight performance (Jain et al., 9 Apr 2025).

Verifier	Best@1	Best@16	Best@26
Execution-Free	19.1%	40.2%	42.8%
Execution-Based	20.3%	41.5%	43.7%
Hybrid (n=4)	21.6%	47.6%	51.0%

(Jain et al., 9 Apr 2025)

5. Experimental Validation and Comparative Results

Extensive benchmarks demonstrate the generality and efficacy of AgentGym-trained agents.

On SWEBench-Verified, R2E-Gym's 32B model with hybrid verifier achieves Pass@1 of 34.4% and Best@26 of 51.0%, matching or exceeding the agentless o1 pipeline (48%) and Claude-3.6 pipelines (50.8%). Gains over prior open-weight SWE agents are 8–14 points absolute, confirming the contribution of scalable procedural data and hybrid verification (Jain et al., 9 Apr 2025).
Across 27 AgentGym-RL tasks, RL-trained open-source 7B models attain an average composite success rate of ~58.6%, significantly outperforming much larger models such as Llama-3.1-70B (~47%) and matching closed-source baselines (e.g., Gemini 2.5-Pro, OpenAI o3) (Xi et al., 10 Sep 2025).
ScalingInter-RL and AgentEvol protocols yield additional performance and sample-efficiency improvements, with self-evolving agents exceeding behavioral cloning upper bounds by up to 8 percentage points in some environments (e.g., BabyAI, WebShop) (Xi et al., 6 Jun 2024).

6. Open-Source Resources and Implementation

All AgentGym components—full codebases, environment Docker recipes, pretrained models (Qwen-2.5-Coder 7B/14B/32B), testing agents, execution-free verifiers, agent trajectories, and benchmarking tools—are released under permissive licenses (e.g., Apache 2.0) (Jain et al., 9 Apr 2025, Xi et al., 10 Sep 2025, Xi et al., 6 Jun 2024). Detailed instructions, reproducibility scripts, prompt templates, and extensibility utilities are included; new environments can be registered by implementing the base API. The primary repositories are:

7. Context, Limitations, and Prospects

AgentGym/R2E-Gym distinguishes itself from prior work through its scale, modularity, and support for both domain-specific (SWE) and domain-general agent training. It addresses prior bottlenecks caused by environment curation and test-time compute by:

Standardizing procedural, back-translation-based task construction.
Providing both reward-weighted self-evolution and curriculum RL.
Enabling hybrid test-time verification leveraging LLMs and programmatic test execution.

Limitations persist: distinct verifiers exhibit bias or low distinguishability; coverage gaps may exist in the procedural task set; and long-horizon RL remains sample-inefficient for particular skills and strategic reasoning (Jain et al., 9 Apr 2025, Xi et al., 10 Sep 2025, Xi et al., 6 Jun 2024). Ongoing work explores finer-grained trajectory selection, environment/task diversification, and more robust hybridization of evaluation signals.

AgentGym provides a foundation for unified, reproducible, and extensible research on generally capable, self-improving agents, setting new performance baselines and methodological standards in agentic LLM research.