ScalingInter-RL: Progressive Curriculum in RL

Updated 11 September 2025

ScalingInter-RL is a curriculum-inspired reinforcement learning strategy that incrementally increases the allowed trajectory horizon to enhance planning and exploration in LLM agents.
The method starts with short-turn exploitation for stable credit assignment and later transitions to long-turn exploration, mitigating gradient variance and training instability.
Evaluations in AgentGym-RL demonstrate that ScalingInter-RL improves performance in tasks like web navigation and text crafting by fostering higher-order decision-making.

ScalingInter-RL refers to a curriculum-inspired reinforcement learning strategy developed to address the primary challenges of training LLM agents for long-horizon, multi-turn, and complex decision-making environments. The methodology is designed to progressively increase the allowed trajectory horizon throughout training, such that an agent initially optimizes for exploitation in short-turn settings (which are more stable and have lower credit-assignment variance), and only later is encouraged to explore strategies requiring extended planning, reflection, and error recovery. This approach has been evaluated within the AgentGym-RL framework, which supports modular reinforcement learning for LLM-based agents across a wide variety of settings and RL algorithms (Xi et al., 10 Sep 2025).

1. Progressive Horizon Curriculum for Long-Horizon RL

ScalingInter-RL operates by restricting the maximum allowed number of interactions (turns) between agent and environment during early training, then periodically relaxing this constraint in a monotonic schedule. At phase $t$ , each trajectory $\tau^t$ is limited to $K_t \leq h_t$ , where $h_t$ is the current maximum horizon. Formally, for policy parameters $\theta$ :

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau|h_t)}[r(\tau)], \quad K_t \leq h_t$

Every $\Delta$ training steps, $h_{t+1} = h_t + \delta_h$ , thus incrementally challenging the agent to master longer, more complex interaction chains over time. Early phases drive the agent toward short-term exploitation; later phases introduce increasingly complex exploration.

This curriculum directly mitigates high-variance gradient estimations and the credit-assignment complexity endemic to long-horizon RL—problems that otherwise lead to unstable training or collapse when LLM agents learn directly on long sequences.

2. Balancing Exploration and Exploitation

The essential feature of ScalingInter-RL is the systematically staged transition between exploitation and exploration:

Exploitation: In early phases (small $h_t$ ), agents focus on maximizing immediate rewards over short horizons. The policy is updated primarily from truncated, stable trajectories, supporting reliable credit assignment. This provides a foundation of robust behaviors and minimizes the risk of divergence early in training.
Exploration: As $h_t$ increases, the agent is required to optimize over longer horizons. This exposes it to richer outcome spaces and more complex dependencies—fostering the ability to plan, reflect, and recover from mistakes. Exploration is thus enabled atop the behavioral robustness gained earlier, resulting in agents less prone to collapse or reward plateaus in complex, multi-stage settings.

This balance is crucial for LLM agents, which are susceptible to instability and catastrophic forgetting in tasks with extended interaction sequences.

3. Formal Algorithmic Structure

The ScalingInter-RL training loop includes explicit schedule control for interaction horizon. At each curriculum phase:

Initialize policy parameters $\theta$ and horizon $h_0$ .
Collect rollout trajectories $\{\tau^t\}$ under $\pi_\theta$ such that $K_t \leq h_t$ .
Compute rewards $r(\tau^t)$ and gradients $\nabla_\theta J(\theta)$ as per standard policy gradient algorithms.
Update policy: $\theta \gets \theta + \alpha \nabla_\theta J(\theta)$ .
Increase horizon: every $\Delta$ steps set $h \gets h + \delta_h$ .

This can be seen as imposing a curriculum on trajectory length, separate from parameter or data scaling, to control the temporal credit assignment regime imposed on the agent (Xi et al., 10 Sep 2025).

4. Empirical Evaluation and Key Insights

Experiments within AgentGym-RL demonstrate the advantages of ScalingInter-RL on a diverse suite of real-world tasks:

Web Navigation: Achieved >10% relative improvement over baseline RL agents, with final performance near top commercial LLM-based agents.
TextCraft: Outperformed the base model by 30 points due to staged mastery of deeper, hierarchical crafting procedures.
Sequential Action Environments (BabyAI, SciWorld): Demonstrated greater robustness and stability, notably preventing the collapse observed when training is immediately exposed to full-length horizons.

Principal findings include:

Early, tight horizon constraints help agents develop “safe” initial policies, avoiding instability known to plague long-horizon RL.
Later, strategic expansion of trajectory length supports higher-order behaviors such as planning and backtracking, which cannot be learned effectively if the curriculum is absent.
The overall learning dynamics emphasize that compute investment in longer trajectory training (i.e., post-training expansion of horizon) may exceed the gains attained by simply scaling network parameters.

5. Theoretical and Practical Implications

ScalingInter-RL establishes a methodology for RL agent training in settings where the task complexity and required planning horizon grow with the number of sequential interactions. This is particularly important for LLM agents interacting with digital environments (web navigation, games, embodied tasks) where naively optimizing for long-horizon rewards causes training instability or myopic policy collapse.

The approach also provides a pathway for RL researchers to address settings where the complexity of value functions grows superlinearly in horizon length, and standard solutions relying on massive model scaling or vast data accumulation are computationally infeasible.

The curriculum framework of ScalingInter-RL directly complements developments in environment design, scalable RL infrastructure, and sample-efficient policy optimization, and is applicable to any policy gradient–based learning algorithm supported in AgentGym-RL. The decoupling of horizon schedule from agent architecture supports modular extensibility across multiple application scenarios.

6. Future Directions and Broader Impact

AgentGym-RL with ScalingInter-RL will be open-sourced, including code and benchmark datasets, to support further community research (Xi et al., 10 Sep 2025). Promising future directions suggested include:

Tightening adaptive control over the horizon increment schedule, potentially leveraging environment feedback or agent progress signals.
Extending curriculum-based horizon scaling to multi-agent RL and compositional task settings.
Combining this progressive horizon strategy with complementary exploration incentives, safety constraints, or meta-learning techniques to push the limits of LLM agent competence in multi-turn real-world tasks.

ScalingInter-RL thus represents a foundational advance for training LLM agents under long-horizon, multi-step RL, offering a concrete, empirically validated route to stable and efficient optimization in environments previously dominated by instability and reward collapse.

PDF Markdown Chat (Pro)

References (1)

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ScalingInter-RL Approach.