EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning (2509.22576v1)

Published 26 Sep 2025 in cs.LG and cs.CL

Abstract: Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

Summary

The paper introduces EPO, a novel RL framework that integrates entropy regularization, smoothing, and adaptive weighting to mitigate cascade failure.
The methodology computes entropy across trajectory turns, ensuring controlled exploration and stable policy convergence in sparse reward environments.
Experimental results show up to 152% improvement on benchmarks like ScienceWorld, demonstrating enhanced training stability and generalization.

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Introduction

This paper introduces Entropy-regularized Policy Optimization (EPO), a framework designed to address the unique challenges faced by training LLM agents in multi-turn environments with sparse rewards. Traditional reinforcement learning (RL) methods such as Proximal Policy Optimization (PPO) suffer from the exploration-exploitation cascade failure in these settings, where early uncontrolled entropy growth leads to unstable policy foundations and late-stage uncertainty propagation hinders coherent strategy formation.

Figure 1 illustrates this cascade failure, highlighting the excessive early exploration and subsequent uncertainty propagation, and contrasting it with the stable entropy levels maintained by EPO.

Figure 1: Exploration-exploitation cascade failure in multi-turn agent training.

Methodology

The proposed EPO framework comprises three synergistic components:

Entropy Regularization: EPO computes entropy across all trajectory turns, adopting a temporal structure that captures early decision impacts on subsequent turns. The entropy regularization term in the policy loss enhances exploration while mitigating premature convergence.
Entropy Smoothing Regularizer: This mechanism maintains policy entropy within dynamically adjusted historical bounds to prevent severe oscillations between overconfidence and over-exploration. The regularizer applies penalties for deviations outside acceptable entropy ranges, enforcing stable training dynamics.
Adaptive Phase-based Weighting: EPO employs a dynamic weighting scheme that balances exploration and exploitation across training phases through a schedule for the smoothing coefficient. This adaptive approach ensures conservative early exploration, transitions through balanced exploration-exploitation, and strengthens stabilization as training progresses.

Algorithm 1 details the implementation steps for EPO, including entropy computations and dynamic coefficient updates.

Experiments

The experimental evaluation demonstrates EPO's significant improvements over traditional RL approaches on benchmarks such as ScienceWorld and ALFWorld.

Table 1 shows that PPO combined with EPO achieves up to 152% improvement in the average success rate on ScienceWorld IID tasks, highlighting the method's effectiveness in optimization and generalization robustness.

Figure 2 presents the training dynamics and performance across evaluation scenarios, illustrating EPO's enhanced training stability and substantial performance gains against baseline methods.

Figure 2: Training dynamics and generalization performance analysis.

Ablation Studies

A series of ablation studies validate the necessity of each component within the EPO framework:

Entropy Smoothing Regularizer: Experiments reveal its critical role in stable convergence, particularly in sparse reward settings where uncontrolled entropy results in cascade failure.
Adaptive Weighting Scheme: Adaptive coefficient $\beta_k$ accelerates early training progress by dynamically modulating regularization intensity.

Figure 3 illustrates the impact of these components on training progression and success rates in sparse environments.

Figure 3: Ablation studies on entropy regularization components.

Model Studies

Additional experiments comparing EPO variants provide deeper insights into effective entropy management strategies in multi-turn tasks:

Decaying Entropy Coefficient: A decaying schedule consistently underperforms due to prematurely suppressed early-turn exploration.
Entropy-based Advantage: EPO outperforms methods that incorporate entropy indirectly, achieving near-perfect success rates due to direct integration into the policy loss.

Figure 4 offers a visual comparison of these strategies.

Figure 4: Model studies on ScienceWorld.

Conclusion

EPO establishes a new paradigm for training LLM agents in multi-turn environments by effectively addressing the exploration-exploitation cascade failure. The framework not only transforms previously untrainable scenarios into smoothly converged optimization problems but also suggests that multi-turn settings demand fundamentally different entropy control than traditional RL approaches. Future work includes integrating memory systems for enhanced trajectory learning and extending EPO to vision-LLMs in visual environments.