T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Published 4 May 2026 in cs.AI | (2605.02178v1)

Abstract: Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (T$^2$PO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, T$^2$PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T$^2$PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T$^2$PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: https://github.com/WillDreamer/T2PO.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper proposes T2PO, a novel algorithm that leverages token- and turn-level uncertainty to mitigate training collapse in multi-turn RL environments.
It introduces key mechanisms—Token-Level Thinking Intervention (TTI) and Turn-Level Dynamical Sampling (TDS)—to curb redundant reasoning and improve credit assignment.
Empirical results show significant gains on benchmarks like WebShop and ALFWorld, with improved success rates and reduced token consumption compared to SOTA methods.

Uncertainty-Guided Exploration for Multi-Turn Agentic RL: The T $^2$ PO Algorithm

Introduction

T $^2$ PO introduces a novel uncertainty-guided control paradigm for LLM agents in multi-turn reinforcement learning (RL) environments. This work addresses pervasive instability and training collapse found in prior RL frameworks—chiefly arising from inefficient exploration in long-horizon tasks where agents repeatedly generate low-information actions and redundant reasoning traces. The proposed method, Token- and Turn-level Policy Optimization (T $^2$ PO), rectifies this by integrating adaptive, intrinsic uncertainty signals at both fine-grained token and coarser turn granularities, driving both improved sample efficiency and training stability.

Motivation and Analysis of Instability

Prevailing multi-turn RL pipelines manifest substantial credit assignment ambiguity, high rollout costs, and, critically, a propensity for what is termed hesitation: agents overthink at the token level (generation of superfluous reasoning text) and engage in repetitive, unproductive interactions at the turn level. Existing techniques—fine-grained credit assignment, reward shaping, trajectory filtering—provide only partial remediation. They operate at trajectory scale or via indirect interventions, resulting in persistently high variance in policy gradients and sensitivity to rollout distribution drift. The identified root cause is inefficient exploration, evidenced by agent behaviors that produce saturated information with accumulating sampling noise but fail to drive task progress or resolve epistemic uncertainty.

Theoretical Framework and Self-Calibrated Uncertainty Signal

T $^2$ PO is grounded in a self-calibrated uncertainty signal. Prior work primarily utilized either entropy or confidence as proxies for uncertainty, both of which have degeneracies—entropy lacks discriminability at the extremes of the distribution, while confidence is invariant to how probability mass is distributed outside the top token. T $^2$ PO overcomes these limitations by fusing normalized entropy ( $\tilde{H}_t$ ) and normalized confidence ( $\tilde{C}_t$ ) into a single scalar signal:

$M_t = \alpha \tilde{H}_t + (1 - \alpha)(1 - \tilde{C}_t)$

where $\alpha$ balances the contribution, enabling robust discrimination of model uncertainty throughout token generation.

Algorithmic Components

Token-Level Thinking Intervention (TTI)

TTI leverages the temporal evolution of the $M_t$ signal to adaptively halt the LLM's internal reasoning once marginal information gain falls below a threshold. After a preset prefix, the running average of the change in $^2$ 0 is computed over a window; when this falls below a small $^2$ 1, intervention is triggered, and the reasoning phase is deterministically terminated by forcing the emission of a reasoning terminator token. This avoids truncating essential task-specific reasoning and suppresses the proliferation of redundant discourse tokens, addressing sampling noise and assigning credit to concise, high-information responses.

Turn-Level Dynamical Sampling (TDS)

TDS operationalizes adaptive resampling at the interaction turn level. For each turn, the aggregated $^2$ 2 values form an observation signal; the absolute change in this signal relative to the previous turn, $^2$ 3, reflects the shift in underlying beliefs. When $^2$ 4 is below a threshold $^2$ 5, a regeneration event is triggered, resampling the entire turn. This prevents the agent from remaining stuck in semantically similar, non-progressive turn-level behaviors, and ensures meaningful diversity across interaction steps.

Integration and Policy Update Scheme

T $^2$ 6PO seamlessly integrates these two intervention mechanisms within a scalable group-based policy optimization framework akin to GiGPO, but with explicit uncertainty-based reasoning control. Policy advantage estimation fuses global (trajectory-scale) and local (turn-relative) return signals, regularizing RL updates and enhancing both sample efficiency and credit assignment fidelity. Additionally, a compact memory context window is maintained to constrain horizon length during optimization, and a strict format penalty is imposed to enforce structured output, both of which further stabilize training.

Empirical Results

T $^2$ 7PO achieves consistently superior results across diverse, high-complexity benchmarks, including WebShop, ALFWorld, and search-augmented QA datasets. Specifically:

WebShop: 81.6–82.4% success rates on Qwen3-4B/8B backbones, with token consumption and required interaction turns reduced by approximately 20–25% versus SOTA.
ALFWorld: Gains of 8–12 success rate points over strongest previous RL baselines; T $^2$ 8PO demonstrates robustness to hyperparameter misspecification and initialization seed variation, avoiding training collapse entirely.
Multi-hop Search QA: T $^2$ 9PO achieves pronounced improvements, notably doubling accuracy on datasets such as MuSiQue.

Ablation studies confirm that both TTI and TDS are critical; removing either produces marked drops in success rate, increased sample inefficiency, and a re-emergence of repeated failures and overlong trajectories. Comparison with alternative thinking control approaches (hard budgets, void turn filtering, length-based reward shaping, SFT cold starts) shows that T $^2$ 0PO's adaptive, intrinsic-signal-driven interventions far outperform any static or heuristic constraint.

Practical and Theoretical Implications

This work formalizes the link between exploration inefficiency and training collapse in multi-turn LLM RL, and establishes that structured, uncertainty-guided control at both the micro (token) and meso (turn) level is necessary for stable, scalable RL in reasoning-driven language agent systems. Practically, T $^2$ 1PO's mechanisms are plug-and-play: they can be incorporated into any group-based policy optimization pipeline and generalized to other architectures, inference engines, and asynchronous scaling frameworks. The approach is orthogonal to other stabilization methods such as RFT and off-policy correction, and composes with them for further gains.

Future Directions

The proposed uncertainty signal design—combining entropy and confidence—opens avenues for further investigation into the geometry of model predictive distributions during RL, potential meta-learning extensions for dynamic adjustment of thresholds, and broader applications to multi-modal agentic RL. Extending T $^2$ 2PO to hierarchical, structured task decompositions and tightly integrating with external symbolic or planning systems represent promising directions for scaling autonomous decision-making.

Conclusion

T $^2$ 3PO provides an effective, theoretically justified, and empirically validated solution to instability and inefficiency in multi-turn agentic RL. By explicitly regulating exploration based on self-calibrated uncertainty at both the token and turn levels, T $^2$ 4PO suppresses low-information actions, mitigates training collapse, and delivers state-of-the-art stable performance across multi-turn, long-horizon reasoning environments. This advance paves the way for reliably training scalable, interpretable, and sample-efficient LLM-based agents for real-world multi-step decision-making tasks.