AgentGym-RL: Scalable Multi-turn LLM Training
- AgentGym-RL is a modular reinforcement learning framework that trains LLM agents for multi-turn, long-horizon tasks in diverse, realistic settings.
- It features a decoupled architecture integrating environment servers, LLM-based agent modules, and a distributed training engine supporting algorithms like PPO and REINFORCE++.
- The ScalingInter-RL approach progressively extends interaction horizons to balance exploration versus exploitation, enhancing decision-making stability and diversity.
AgentGym-RL is a modular, scalable reinforcement learning (RL) framework designed to train LLM agents for multi-turn, long-horizon decision making in diverse, realistic environments. Targeting agentic intelligence beyond supervised fine-tuning, it unifies environment orchestration, agent decision-making, and RL-based optimization within a flexible, extensible architecture. A core methodological innovation is the ScalingInter-RL approach, which progressively increases the agents’ allowed interaction horizon to balance exploitation and exploration—addressing stability and behavioral diversity as agents tackle complex, real-world tasks (Xi et al., 10 Sep 2025).
1. Modular and Decoupled Framework Architecture
AgentGym-RL’s architecture is organized around three principal modules:
- Environment Module: Each environment is packaged as an independent server (service) with HTTP-based API, enabling parallel rollouts and reproducible, decoupled interactions. The provided suite spans five scenario categories: Web Navigation (e.g., WebArena tasks), Deep Search (search engine integration with multi-hop QA), Digital Games (TextCraft, a simplified Minecraft-like text world), Embodied Tasks (e.g., grid-world navigation with BabyAI), and Scientific Tasks (SciWorld, mimicking laboratory experiment procedures). Each environment provides a standardized API for reset, step, observation, and reward retrieval.
- Agent Module: The LLM-based agent receives observations from the environment, generates natural language or API actions, and maintains internal state for multi-turn reasoning, plan execution, reflection, or correction. AgentGym-RL supports agent behaviors ranging from direct action selection to long-horizon deliberation and recovery from failures.
- Training Module: A unified RL training engine supporting online and offline modes, multiple parallel processes/nodes, a diagnostic subsystem (for metrics such as policy entropy, KL divergence, and reward curves), and native compatibility with major RL algorithms (PPO, GRPO, REINFORCE++, RLOO). Training proceeds by collecting and buffering trajectories from parallel rollouts, estimating gradients, and updating the policy across distributed agents.
This architecture separates agent logic from environment code, simplifying extensibility and reproducibility. The server-client design allows rapid integration of new tasks, datasets, or agent variants.
2. Supported Reinforcement Learning Algorithms
AgentGym-RL integrates several mainstream RL algorithms, adapted for LLM agents:
Algorithm | Key Properties | Usage Context |
---|---|---|
PPO | Policy gradient with clipped surrogate objective | Primary algorithm for stability |
GRPO | PPO variant, reward-group normalization | Handling action heterogeneity |
REINFORCE++ | REINFORCE with PPO-style clipping and KL penalties | Default for high-variance tasks |
RLOO | Uses average-reward baseline for variance reduction | Additional variant |
All methods optimize the canonical objective
with policy gradients
and learning rate–controlled updates.
These algorithms are adapted to support LLM agents, which require sampling and optimization over natural language actions and multi-turn trajectories. This setup enables direct online RL optimization, bypassing reliance on supervised fine-tuning (SFT).
3. ScalingInter-RL: Progressive Interaction-Scaling
ScalingInter-RL is a curriculum-based RL training scheme addressing the exploration–exploitation trade-off in long-horizon environments. The core principle is:
- Early Training: Constrain the agent to a short maximal interaction horizon (the number of allowed environment steps per episode), forcing efficient exploitation and skill acquisition.
- Progressive Expansion: As the agent achieves satisfactory performance, increment the horizon ( → ) according to a predefined schedule, periodically after every training steps.
- Late Training: Longer horizons facilitate diverse exploratory behaviors, planning over extended decision sequences, and the development of robust recovery strategies.
Formally, in training phase , trajectories are limited to per episode. The horizon increment schedule is monotonic (), with increments tuned to the task domain. This minimizes premature policy collapse, encourages gradual discovery of long-term dependencies, and improves both convergence and generalization.
4. Empirical Performance Across Domains
AgentGym-RL and the ScalingInter-RL approach have been extensively evaluated on 27 tasks spanning five environment categories:
- Web Navigation: Using WebArena, RL-trained agents (AgentGym-RL-7B, ScalingInter-7B) achieve competitive accuracies with state-of-the-art commercial models like GPT-4o and Gemini-2.5-Pro, outperforming proprietary models on certain subtasks (e.g., shopping, CMS).
- Deep Search: On QA datasets such as Natural Questions, TriviaQA, PopQA, and HotpotQA, ScalingInter-7B exceeds the scores of all major open-source baselines and is comparable to top-tier closed-source models.
- Digital Games (TextCraft): RL agents set new best results for intermediate crafting tree depths, with nonzero completions on difficult (depth-6) instances—exceeding or matching commercial models.
- Embodied Tasks (BabyAI): RL agents display significant navigation improvements, achieving accuracies on par with leading benchmarks.
- Scientific Tasks (SciWorld): RL agents demonstrate strong gains on scientific reasoning and procedural tasks, though some subdomains (e.g., Chem-Mix) remain challenging for all agent types.
Key findings include (i) RL-trained models surpassing similar-sized SFT baselines, (ii) narrow performance gaps with much larger proprietary LLMs, and (iii) interaction horizon/compute scaling (test-time/post-training) sometimes yielding higher returns than scale-up by model size alone.
5. Pseudocode, Implementation, and Training Protocol
A high-level pseudocode outline is:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
initialize_policy_params(theta) set_horizon(h = h_1) while not converged: for each parallel environment-client: # Interact up to the allowed horizon trajectory = collect_trajectory(policy=theta, max_length=h) # Estimate policy gradient and update grad_theta = estimate_policy_gradient(trajectories) theta = theta + alpha * grad_theta # Curriculum: increase horizon periodically if training_step % delta == 0: h += delta_h |
Agents interact with environments via batched server-client APIs, collecting rollouts truncated according to the current . Collected trajectories are used for RL updates (typically policy gradient with PPO or variant) via distributed, batched gradient ascent. Policy entropy regularization, KL divergence monitoring, and reward curve logging are incorporated into the diagnostic subsystem.
6. Practical Applications and Extensibility
AgentGym-RL is expressly designed for:
- Training LLM agents in highly diverse, realistic environments without SFT dependencies
- Multi-turn, long-horizon decision making, including web navigation, embodied reasoning, game playing, and procedural scientific discovery
- Fair benchmarking of RL algorithms and agent architectures in an environment-agnostic, reproducible experimental setting
- Rapid integration of new environments owing to the modular server–client architecture and unified API
Notably, under the ScalingInter-RL regime, even open-source 7B-scale models can approach or surpass proprietary models on complex tasks—a result with implications for efficient model scaling and compute allocation.
7. Open-Source Release and Community Impact
The complete AgentGym-RL framework—including full code, diagnostic tools, and curated datasets—is to be open-sourced to support community research and further development in agentic intelligence and LLM-driven reinforcement learning. Extensive experimental evidence demonstrates its effectiveness in stabilizing RL optimization, supporting behavioral diversity, and closing the gap with closed-source foundation models (Xi et al., 10 Sep 2025).
A plausible implication is that AgentGym-RL, due to its extensible architecture and systematic horizon curriculum, will become a standard benchmarking tool for future research on LLM-based multi-turn RL agents, agentic planning, and interactive decision making at scale.