Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 47 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 11 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4 30 tok/s Pro
2000 character limit reached

AgentGym-RL: Scalable Multi-turn LLM Training

Updated 11 September 2025
  • AgentGym-RL is a modular reinforcement learning framework that trains LLM agents for multi-turn, long-horizon tasks in diverse, realistic settings.
  • It features a decoupled architecture integrating environment servers, LLM-based agent modules, and a distributed training engine supporting algorithms like PPO and REINFORCE++.
  • The ScalingInter-RL approach progressively extends interaction horizons to balance exploration versus exploitation, enhancing decision-making stability and diversity.

AgentGym-RL is a modular, scalable reinforcement learning (RL) framework designed to train LLM agents for multi-turn, long-horizon decision making in diverse, realistic environments. Targeting agentic intelligence beyond supervised fine-tuning, it unifies environment orchestration, agent decision-making, and RL-based optimization within a flexible, extensible architecture. A core methodological innovation is the ScalingInter-RL approach, which progressively increases the agents’ allowed interaction horizon to balance exploitation and exploration—addressing stability and behavioral diversity as agents tackle complex, real-world tasks (Xi et al., 10 Sep 2025).

1. Modular and Decoupled Framework Architecture

AgentGym-RL’s architecture is organized around three principal modules:

  • Environment Module: Each environment is packaged as an independent server (service) with HTTP-based API, enabling parallel rollouts and reproducible, decoupled interactions. The provided suite spans five scenario categories: Web Navigation (e.g., WebArena tasks), Deep Search (search engine integration with multi-hop QA), Digital Games (TextCraft, a simplified Minecraft-like text world), Embodied Tasks (e.g., grid-world navigation with BabyAI), and Scientific Tasks (SciWorld, mimicking laboratory experiment procedures). Each environment provides a standardized API for reset, step, observation, and reward retrieval.
  • Agent Module: The LLM-based agent receives observations from the environment, generates natural language or API actions, and maintains internal state for multi-turn reasoning, plan execution, reflection, or correction. AgentGym-RL supports agent behaviors ranging from direct action selection to long-horizon deliberation and recovery from failures.
  • Training Module: A unified RL training engine supporting online and offline modes, multiple parallel processes/nodes, a diagnostic subsystem (for metrics such as policy entropy, KL divergence, and reward curves), and native compatibility with major RL algorithms (PPO, GRPO, REINFORCE++, RLOO). Training proceeds by collecting and buffering trajectories from parallel rollouts, estimating gradients, and updating the policy across distributed agents.

This architecture separates agent logic from environment code, simplifying extensibility and reproducibility. The server-client design allows rapid integration of new tasks, datasets, or agent variants.

2. Supported Reinforcement Learning Algorithms

AgentGym-RL integrates several mainstream RL algorithms, adapted for LLM agents:

Algorithm Key Properties Usage Context
PPO Policy gradient with clipped surrogate objective Primary algorithm for stability
GRPO PPO variant, reward-group normalization Handling action heterogeneity
REINFORCE++ REINFORCE with PPO-style clipping and KL penalties Default for high-variance tasks
RLOO Uses average-reward baseline for variance reduction Additional variant

All methods optimize the canonical objective

J(θ)=E[r(τ)]J(\theta) = \mathbb{E}[r(\tau)]

with policy gradients

θJ(θ)=E[r(τ)kθlogπθ(aksk)]\nabla_\theta J(\theta) = \mathbb{E}\bigg[r(\tau) \sum_k \nabla_\theta \log \pi_\theta(a_k \mid s_k)\bigg]

and learning rate–controlled updates.

These algorithms are adapted to support LLM agents, which require sampling and optimization over natural language actions and multi-turn trajectories. This setup enables direct online RL optimization, bypassing reliance on supervised fine-tuning (SFT).

3. ScalingInter-RL: Progressive Interaction-Scaling

ScalingInter-RL is a curriculum-based RL training scheme addressing the exploration–exploitation trade-off in long-horizon environments. The core principle is:

  • Early Training: Constrain the agent to a short maximal interaction horizon h1h_1 (the number of allowed environment steps per episode), forcing efficient exploitation and skill acquisition.
  • Progressive Expansion: As the agent achieves satisfactory performance, increment the horizon (hth_tht+1=ht+δhh_{t+1} = h_t + \delta_h) according to a predefined schedule, periodically after every Δ\Delta training steps.
  • Late Training: Longer horizons facilitate diverse exploratory behaviors, planning over extended decision sequences, and the development of robust recovery strategies.

Formally, in training phase tt, trajectories are limited to KthtK_t \leq h_t per episode. The horizon increment schedule is monotonic (ht+1>hth_{t+1} > h_t), with increments δh\delta_h tuned to the task domain. This minimizes premature policy collapse, encourages gradual discovery of long-term dependencies, and improves both convergence and generalization.

4. Empirical Performance Across Domains

AgentGym-RL and the ScalingInter-RL approach have been extensively evaluated on 27 tasks spanning five environment categories:

  • Web Navigation: Using WebArena, RL-trained agents (AgentGym-RL-7B, ScalingInter-7B) achieve competitive accuracies with state-of-the-art commercial models like GPT-4o and Gemini-2.5-Pro, outperforming proprietary models on certain subtasks (e.g., shopping, CMS).
  • Deep Search: On QA datasets such as Natural Questions, TriviaQA, PopQA, and HotpotQA, ScalingInter-7B exceeds the scores of all major open-source baselines and is comparable to top-tier closed-source models.
  • Digital Games (TextCraft): RL agents set new best results for intermediate crafting tree depths, with nonzero completions on difficult (depth-6) instances—exceeding or matching commercial models.
  • Embodied Tasks (BabyAI): RL agents display significant navigation improvements, achieving accuracies on par with leading benchmarks.
  • Scientific Tasks (SciWorld): RL agents demonstrate strong gains on scientific reasoning and procedural tasks, though some subdomains (e.g., Chem-Mix) remain challenging for all agent types.

Key findings include (i) RL-trained models surpassing similar-sized SFT baselines, (ii) narrow performance gaps with much larger proprietary LLMs, and (iii) interaction horizon/compute scaling (test-time/post-training) sometimes yielding higher returns than scale-up by model size alone.

5. Pseudocode, Implementation, and Training Protocol

A high-level pseudocode outline is:

1
2
3
4
5
6
7
8
9
10
11
12
13
initialize_policy_params(theta)
set_horizon(h = h_1)

while not converged:
    for each parallel environment-client:
        # Interact up to the allowed horizon
        trajectory = collect_trajectory(policy=theta, max_length=h)
    # Estimate policy gradient and update
    grad_theta = estimate_policy_gradient(trajectories)
    theta = theta + alpha * grad_theta
    # Curriculum: increase horizon periodically
    if training_step % delta == 0:
        h += delta_h

Agents interact with environments via batched server-client APIs, collecting rollouts truncated according to the current hth_t. Collected trajectories are used for RL updates (typically policy gradient with PPO or variant) via distributed, batched gradient ascent. Policy entropy regularization, KL divergence monitoring, and reward curve logging are incorporated into the diagnostic subsystem.

6. Practical Applications and Extensibility

AgentGym-RL is expressly designed for:

  • Training LLM agents in highly diverse, realistic environments without SFT dependencies
  • Multi-turn, long-horizon decision making, including web navigation, embodied reasoning, game playing, and procedural scientific discovery
  • Fair benchmarking of RL algorithms and agent architectures in an environment-agnostic, reproducible experimental setting
  • Rapid integration of new environments owing to the modular server–client architecture and unified API

Notably, under the ScalingInter-RL regime, even open-source 7B-scale models can approach or surpass proprietary models on complex tasks—a result with implications for efficient model scaling and compute allocation.

7. Open-Source Release and Community Impact

The complete AgentGym-RL framework—including full code, diagnostic tools, and curated datasets—is to be open-sourced to support community research and further development in agentic intelligence and LLM-driven reinforcement learning. Extensive experimental evidence demonstrates its effectiveness in stabilizing RL optimization, supporting behavioral diversity, and closing the gap with closed-source foundation models (Xi et al., 10 Sep 2025).

A plausible implication is that AgentGym-RL, due to its extensible architecture and systematic horizon curriculum, will become a standard benchmarking tool for future research on LLM-based multi-turn RL agents, agentic planning, and interactive decision making at scale.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)