- The paper proposes T2PO, a novel algorithm that leverages token- and turn-level uncertainty to mitigate training collapse in multi-turn RL environments.
- It introduces key mechanisms—Token-Level Thinking Intervention (TTI) and Turn-Level Dynamical Sampling (TDS)—to curb redundant reasoning and improve credit assignment.
- Empirical results show significant gains on benchmarks like WebShop and ALFWorld, with improved success rates and reduced token consumption compared to SOTA methods.
Uncertainty-Guided Exploration for Multi-Turn Agentic RL: The T2PO Algorithm
Introduction
T2PO introduces a novel uncertainty-guided control paradigm for LLM agents in multi-turn reinforcement learning (RL) environments. This work addresses pervasive instability and training collapse found in prior RL frameworks—chiefly arising from inefficient exploration in long-horizon tasks where agents repeatedly generate low-information actions and redundant reasoning traces. The proposed method, Token- and Turn-level Policy Optimization (T2PO), rectifies this by integrating adaptive, intrinsic uncertainty signals at both fine-grained token and coarser turn granularities, driving both improved sample efficiency and training stability.
Motivation and Analysis of Instability
Prevailing multi-turn RL pipelines manifest substantial credit assignment ambiguity, high rollout costs, and, critically, a propensity for what is termed hesitation: agents overthink at the token level (generation of superfluous reasoning text) and engage in repetitive, unproductive interactions at the turn level. Existing techniques—fine-grained credit assignment, reward shaping, trajectory filtering—provide only partial remediation. They operate at trajectory scale or via indirect interventions, resulting in persistently high variance in policy gradients and sensitivity to rollout distribution drift. The identified root cause is inefficient exploration, evidenced by agent behaviors that produce saturated information with accumulating sampling noise but fail to drive task progress or resolve epistemic uncertainty.
Theoretical Framework and Self-Calibrated Uncertainty Signal
T2PO is grounded in a self-calibrated uncertainty signal. Prior work primarily utilized either entropy or confidence as proxies for uncertainty, both of which have degeneracies—entropy lacks discriminability at the extremes of the distribution, while confidence is invariant to how probability mass is distributed outside the top token. T2PO overcomes these limitations by fusing normalized entropy (H~t) and normalized confidence (C~t) into a single scalar signal:
Mt=αH~t+(1−α)(1−C~t)
where α balances the contribution, enabling robust discrimination of model uncertainty throughout token generation.
Algorithmic Components
Token-Level Thinking Intervention (TTI)
TTI leverages the temporal evolution of the Mt signal to adaptively halt the LLM's internal reasoning once marginal information gain falls below a threshold. After a preset prefix, the running average of the change in 20 is computed over a window; when this falls below a small 21, intervention is triggered, and the reasoning phase is deterministically terminated by forcing the emission of a reasoning terminator token. This avoids truncating essential task-specific reasoning and suppresses the proliferation of redundant discourse tokens, addressing sampling noise and assigning credit to concise, high-information responses.
Turn-Level Dynamical Sampling (TDS)
TDS operationalizes adaptive resampling at the interaction turn level. For each turn, the aggregated 22 values form an observation signal; the absolute change in this signal relative to the previous turn, 23, reflects the shift in underlying beliefs. When 24 is below a threshold 25, a regeneration event is triggered, resampling the entire turn. This prevents the agent from remaining stuck in semantically similar, non-progressive turn-level behaviors, and ensures meaningful diversity across interaction steps.
Integration and Policy Update Scheme
T26PO seamlessly integrates these two intervention mechanisms within a scalable group-based policy optimization framework akin to GiGPO, but with explicit uncertainty-based reasoning control. Policy advantage estimation fuses global (trajectory-scale) and local (turn-relative) return signals, regularizing RL updates and enhancing both sample efficiency and credit assignment fidelity. Additionally, a compact memory context window is maintained to constrain horizon length during optimization, and a strict format penalty is imposed to enforce structured output, both of which further stabilize training.
Empirical Results
T27PO achieves consistently superior results across diverse, high-complexity benchmarks, including WebShop, ALFWorld, and search-augmented QA datasets. Specifically:
- WebShop: 81.6–82.4% success rates on Qwen3-4B/8B backbones, with token consumption and required interaction turns reduced by approximately 20–25% versus SOTA.
- ALFWorld: Gains of 8–12 success rate points over strongest previous RL baselines; T28PO demonstrates robustness to hyperparameter misspecification and initialization seed variation, avoiding training collapse entirely.
- Multi-hop Search QA: T29PO achieves pronounced improvements, notably doubling accuracy on datasets such as MuSiQue.
Ablation studies confirm that both TTI and TDS are critical; removing either produces marked drops in success rate, increased sample inefficiency, and a re-emergence of repeated failures and overlong trajectories. Comparison with alternative thinking control approaches (hard budgets, void turn filtering, length-based reward shaping, SFT cold starts) shows that T20PO's adaptive, intrinsic-signal-driven interventions far outperform any static or heuristic constraint.
Practical and Theoretical Implications
This work formalizes the link between exploration inefficiency and training collapse in multi-turn LLM RL, and establishes that structured, uncertainty-guided control at both the micro (token) and meso (turn) level is necessary for stable, scalable RL in reasoning-driven language agent systems. Practically, T21PO's mechanisms are plug-and-play: they can be incorporated into any group-based policy optimization pipeline and generalized to other architectures, inference engines, and asynchronous scaling frameworks. The approach is orthogonal to other stabilization methods such as RFT and off-policy correction, and composes with them for further gains.
Future Directions
The proposed uncertainty signal design—combining entropy and confidence—opens avenues for further investigation into the geometry of model predictive distributions during RL, potential meta-learning extensions for dynamic adjustment of thresholds, and broader applications to multi-modal agentic RL. Extending T22PO to hierarchical, structured task decompositions and tightly integrating with external symbolic or planning systems represent promising directions for scaling autonomous decision-making.
Conclusion
T23PO provides an effective, theoretically justified, and empirically validated solution to instability and inefficiency in multi-turn agentic RL. By explicitly regulating exploration based on self-calibrated uncertainty at both the token and turn levels, T24PO suppresses low-information actions, mitigates training collapse, and delivers state-of-the-art stable performance across multi-turn, long-horizon reasoning environments. This advance paves the way for reliably training scalable, interpretable, and sample-efficient LLM-based agents for real-world multi-step decision-making tasks.
For details, see "T25PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning" (2605.02178).