Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

ScalingInter-RL: Progressive Curriculum in RL

Updated 11 September 2025
  • ScalingInter-RL is a curriculum-inspired reinforcement learning strategy that incrementally increases the allowed trajectory horizon to enhance planning and exploration in LLM agents.
  • The method starts with short-turn exploitation for stable credit assignment and later transitions to long-turn exploration, mitigating gradient variance and training instability.
  • Evaluations in AgentGym-RL demonstrate that ScalingInter-RL improves performance in tasks like web navigation and text crafting by fostering higher-order decision-making.

ScalingInter-RL refers to a curriculum-inspired reinforcement learning strategy developed to address the primary challenges of training LLM agents for long-horizon, multi-turn, and complex decision-making environments. The methodology is designed to progressively increase the allowed trajectory horizon throughout training, such that an agent initially optimizes for exploitation in short-turn settings (which are more stable and have lower credit-assignment variance), and only later is encouraged to explore strategies requiring extended planning, reflection, and error recovery. This approach has been evaluated within the AgentGym-RL framework, which supports modular reinforcement learning for LLM-based agents across a wide variety of settings and RL algorithms (Xi et al., 10 Sep 2025).

1. Progressive Horizon Curriculum for Long-Horizon RL

ScalingInter-RL operates by restricting the maximum allowed number of interactions (turns) between agent and environment during early training, then periodically relaxing this constraint in a monotonic schedule. At phase tt, each trajectory τt\tau^t is limited to KthtK_t \leq h_t, where hth_t is the current maximum horizon. Formally, for policy parameters θ\theta:

J(θ)=Eτπθ(τht)[r(τ)],KthtJ(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau|h_t)}[r(\tau)], \quad K_t \leq h_t

Every Δ\Delta training steps, ht+1=ht+δhh_{t+1} = h_t + \delta_h, thus incrementally challenging the agent to master longer, more complex interaction chains over time. Early phases drive the agent toward short-term exploitation; later phases introduce increasingly complex exploration.

This curriculum directly mitigates high-variance gradient estimations and the credit-assignment complexity endemic to long-horizon RL—problems that otherwise lead to unstable training or collapse when LLM agents learn directly on long sequences.

2. Balancing Exploration and Exploitation

The essential feature of ScalingInter-RL is the systematically staged transition between exploitation and exploration:

  • Exploitation: In early phases (small hth_t), agents focus on maximizing immediate rewards over short horizons. The policy is updated primarily from truncated, stable trajectories, supporting reliable credit assignment. This provides a foundation of robust behaviors and minimizes the risk of divergence early in training.
  • Exploration: As hth_t increases, the agent is required to optimize over longer horizons. This exposes it to richer outcome spaces and more complex dependencies—fostering the ability to plan, reflect, and recover from mistakes. Exploration is thus enabled atop the behavioral robustness gained earlier, resulting in agents less prone to collapse or reward plateaus in complex, multi-stage settings.

This balance is crucial for LLM agents, which are susceptible to instability and catastrophic forgetting in tasks with extended interaction sequences.

3. Formal Algorithmic Structure

The ScalingInter-RL training loop includes explicit schedule control for interaction horizon. At each curriculum phase:

  1. Initialize policy parameters θ\theta and horizon h0h_0.
  2. Collect rollout trajectories {τt}\{\tau^t\} under πθ\pi_\theta such that KthtK_t \leq h_t.
  3. Compute rewards r(τt)r(\tau^t) and gradients θJ(θ)\nabla_\theta J(\theta) as per standard policy gradient algorithms.
  4. Update policy: θθ+αθJ(θ)\theta \gets \theta + \alpha \nabla_\theta J(\theta).
  5. Increase horizon: every Δ\Delta steps set hh+δhh \gets h + \delta_h.

This can be seen as imposing a curriculum on trajectory length, separate from parameter or data scaling, to control the temporal credit assignment regime imposed on the agent (Xi et al., 10 Sep 2025).

4. Empirical Evaluation and Key Insights

Experiments within AgentGym-RL demonstrate the advantages of ScalingInter-RL on a diverse suite of real-world tasks:

  • Web Navigation: Achieved >10% relative improvement over baseline RL agents, with final performance near top commercial LLM-based agents.
  • TextCraft: Outperformed the base model by 30 points due to staged mastery of deeper, hierarchical crafting procedures.
  • Sequential Action Environments (BabyAI, SciWorld): Demonstrated greater robustness and stability, notably preventing the collapse observed when training is immediately exposed to full-length horizons.

Principal findings include:

  • Early, tight horizon constraints help agents develop “safe” initial policies, avoiding instability known to plague long-horizon RL.
  • Later, strategic expansion of trajectory length supports higher-order behaviors such as planning and backtracking, which cannot be learned effectively if the curriculum is absent.
  • The overall learning dynamics emphasize that compute investment in longer trajectory training (i.e., post-training expansion of horizon) may exceed the gains attained by simply scaling network parameters.

5. Theoretical and Practical Implications

ScalingInter-RL establishes a methodology for RL agent training in settings where the task complexity and required planning horizon grow with the number of sequential interactions. This is particularly important for LLM agents interacting with digital environments (web navigation, games, embodied tasks) where naively optimizing for long-horizon rewards causes training instability or myopic policy collapse.

The approach also provides a pathway for RL researchers to address settings where the complexity of value functions grows superlinearly in horizon length, and standard solutions relying on massive model scaling or vast data accumulation are computationally infeasible.

The curriculum framework of ScalingInter-RL directly complements developments in environment design, scalable RL infrastructure, and sample-efficient policy optimization, and is applicable to any policy gradient–based learning algorithm supported in AgentGym-RL. The decoupling of horizon schedule from agent architecture supports modular extensibility across multiple application scenarios.

6. Future Directions and Broader Impact

AgentGym-RL with ScalingInter-RL will be open-sourced, including code and benchmark datasets, to support further community research (Xi et al., 10 Sep 2025). Promising future directions suggested include:

  • Tightening adaptive control over the horizon increment schedule, potentially leveraging environment feedback or agent progress signals.
  • Extending curriculum-based horizon scaling to multi-agent RL and compositional task settings.
  • Combining this progressive horizon strategy with complementary exploration incentives, safety constraints, or meta-learning techniques to push the limits of LLM agent competence in multi-turn real-world tasks.

ScalingInter-RL thus represents a foundational advance for training LLM agents under long-horizon, multi-step RL, offering a concrete, empirically validated route to stable and efficient optimization in environments previously dominated by instability and reward collapse.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)