AgentGym: Evolving LLM-Based Agents

Updated 30 October 2025

AgentGym is a modular research framework that standardizes training, evolution, and benchmarking of LLM-based agents across various environments and tasks.
It employs a unified HTTP-based architecture and supports self-evolution through a two-phase process combining behavioral cloning with reward-weighted updates.
The framework provides extensive, openly available datasets and benchmarks, establishing a reproducible basis for multi-domain agent research.

AgentGym is a research framework for training, evolving, and benchmarking LLM-based agents across a diverse spectrum of environments and tasks. Architected to move beyond the limitations of step-by-step imitation and single-environment specialization, AgentGym provides an open, extensible infrastructure and methodological protocol for developing generalist agents with the capacity for self-evolution through reward-guided learning. The framework standardizes real-time, concurrent, multi-environment agent exploration and supports systematic evaluation via its benchmark suite, providing openly released data, code, and model checkpoints for the broader research community (Xi et al., 6 Jun 2024).

1. Framework Architecture and Modularity

AgentGym implements a modular design grounded in HTTP-based environment service orchestration, enabling seamless isolation and parallelization of agent–environment interactions. Each of the 14 supported environments is encapsulated as an independent, reproducible service implementing a strict API with endpoints for instantiation (/createEnv), observation (/observation), action stepping (/step), and reset. An AgentController layer mediates agent requests, enforces standardized ReAct-style trajectory formatting, and provides multi-round, real-time API feedback for agent adaptation.

The environment interface is unified across all domains, allowing agents to be implemented, evaluated, and evolved on a single codebase regardless of task semantics. A client SDK further abstracts HTTP communication, presenting an environment-agnostic agent interaction workflow for developers. This modularity supports arbitrary expansion to new domains that conform to the API.

2. Environment Suite, Task Coverage, and Data Standardization

AgentGym currently supports 14 environments spanning 89 task types, including internet/web navigation (WebShop, WebArena), embodied/household simulations (ALFWorld, BabyAI), digital/text games (TextCraft, MAZE, Wordle), scientific/logic reasoning (SciWorld), tool use (Weather, Movie, Academia, Sheet, TODOList), and database/programming (BIRD). Each environment specifies a finite set of instruction-driven tasks with exposed step/reward mechanics.

The framework comes with a large instruction database (20,509 diverse queries), constructed from original environment tasks, self-instruct expansion, and crowdsourcing. All agent–environment interactions are stored as uni-format trajectories following an explicit Thought–Action–Observation loop per step, as defined in the ReAct protocol. Over 6,130 expert or SOTA-agent trajectories comprise the AgentTraj learning set for initial behavioral cloning, and AgentTraj-L (∼14,485) provides a larger pool for upper-bound analysis. For systematic assessment, the AgentEval benchmark suite contains 1,160 curated evaluation tasks spanning all environments.

Environment Class	Example Envs	No. Tasks	Instructions	Trajectories
Web Navigation	WebShop, WebArena	2	~7k	>4k
Embodied/Household	ALFWorld, BabyAI	46	~4.7k	>3.2k
Digital/Text Games	TextCraft, MAZE	3	~1.9k	>1.6k
Scientific Reasoning	SciWorld	1	~1k	>800
Tools, Programming	Weather, BIRD	37	~3k	>2k

3. Real-Time, Concurrent, and Uni-Format Agent Interaction

All environment communication and data logging in AgentGym is standardized. Environments run as isolated services, supporting parallel and concurrent agent exploration: multiple environments can be instantiated and stepped in real time, allowing agents to interact with several distinct tasks simultaneously. The trajectory format is uniform (Thought/Action/Observation), facilitating batch learning, replay, and multi-task evaluation.

Online and offline agent policy updates, data collection, and continuous evaluation are supported in both real and simulated time, enabling frameworks to study synchronization effects, rapid environment switching, and transferability across tasks.

4. The AgentEvol Self-Evolution Algorithm

AgentEvol is the framework’s core method for enabling and analyzing self-evolving, generalist agents across environments and unseen tasks. The algorithm operates as an iterative two-phase process emphasizing both efficient bootstrapping and scalable reward-based self-improvement:

4.1. Behavioral Cloning Bootstrap

A base agent is first trained using classic behavioral cloning from the AgentTraj dataset, optimizing a maximum-likelihood objective over next-thought and next-action tokens at each step: $\mathcal{J}_{BC}(\theta) = \mathbb{E}_{(e,u,\tau) \sim \mathcal{D}_s} \left[ \log \pi_\theta(\tau | e, u) \right]$ where $\tau$ is the standardized trajectory sequence, and $\pi_\theta$ the agent’s policy.

4.2. Self-Evolution via Reward-Weighted Learning

AgentEvol proceeds by iteratively alternating between exploration (trajectory collection by executing the current agent on randomly sampled new, possibly unseen tasks in each environment) and learning (reward-weighted supervised updates):

Trajectories $\tau^j$ for task $u^j$ in environment $e$ are collected into $\mathcal{D}_m^e$ .
Each trajectory receives a reward $r(e, u, \tau)$ based on task-specific success or dense reward functions.
All new data $\mathcal{D}_m$ is merged with existing data $\mathcal{D}_s$ , and the next iteration optimizes the reward-weighted log-likelihood: $\mathcal{J}_{Evol}(\theta) = \mathbb{E}_{(e,u,\tau) \sim \mathcal{D}_m} [ r(e, u, \tau) \log \pi_\theta(\tau|e,u) ]$ This process formalizes inference-based policy learning, using probabilistic lower bounds and off-policy data, and does not rely on unstable or sample-inefficient RL such as high-variance policy gradients.

Full learning procedure pseudocode:

For iteration m = 1...M:
  1. Exploration: For each environment e, sample tasks u^j, execute agent πθ^m, collect new trajectories (e, u^j, τ^j) and reward r(e, u^j, τ^j)
  2. Aggregate with all prior data
  3. Learning: Update parameters θ^(m+1) to maximize reward-weighted trajectory likelihood

The base agent is thus continually evolved to increase its general capacity, adapt to previously unseen instructions, and improve efficiency and success rate across a distribution of environments and tasks.

5. Experimental Protocols and Empirical Analysis

AgentGym benchmarking follows a rigorous protocol. Agents (both open-weight, such as Llama-2 Chat, and closed/commercial models, such as GPT-4-Turbo) are evaluated on the AgentEval benchmark across 11 environment types. Three key variants are compared:

BC $_{base}$ : behavioral cloning from AgentTraj only.
BC $_{large}$ : behavioral cloning from the larger AgentTraj-L.
AgentEvol: the result of self-evolution as described above.

Key findings:

AgentEvol outperforms all baselines, commercial and open, on most environments and even exceeds the performance upper bound of behavioral cloning alone.
Gains are especially pronounced on tasks with high distributional shift from the original instruction/trajectory set, demonstrating genuine generalization capacity.
Case studies and ablations confirm the necessity of curriculum-like evolution (diverse instruction sampling, dense reward feedback, and iterative learning).

Sample results (success rate \%) show AgentEvol surpassing GPT-4-Turbo, AgentLM-70B, and even BC performed with full AgentTraj-L data on most domains, e.g., in WebShop, ALFWorld, and TextCraft.

6. Platform Release and Research Impact

The AgentGym suite is openly released and comprises:

Complete platform codebase with all 14 environment services and the AgentController interface.
The instruction and trajectory dataset, including AgentTraj, AgentTraj-L, and the AgentEval benchmark.
Pretrained checkpoints for BC, BC-large, and AgentEvol models on multiple LLM backbones.
Algorithm implementations (behavioral cloning, AgentEvol), evaluation scripts, and explicit data formats.

The release provides a standardized foundation for benchmarking, comparison, and further development of generalist agent methodologies in the LLM agent research community, facilitating reproducibility and extensibility.

7. Scientific Contributions and Methodological Significance

AgentGym introduces the first unified framework for evolving LLM-based generalist agents capable of simultaneously learning and transferring skills across a spectrum of environments with standardized interfaces, enabling robust multi-domain agent exploration and optimization at scale. The AgentEvol algorithm formally advances the state of reward-based agent evolution through scalable, probabilistic, reward-weighted supervised learning on off-policy, multi-environment data, either with binary or dense reward regimes.

This design establishes a new empirical and infrastructural baseline for generalist agent research and supports precise technical comparison across learning paradigms, model sizes, and environment/task structures (Xi et al., 6 Jun 2024).

PDF Markdown Chat (Pro)

References (1)

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to AgentGym Framework.