Agentic Reinforced Policy Optimization (2507.19849v1)

Published 26 Jul 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of LLMs for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO

Summary

The paper introduces ARPO, which leverages entropy spikes after tool usage to trigger adaptive partial rollouts that enhance multi-turn reasoning.
It employs a group-relative policy optimization method to assign shared and individual advantages, leading to stable and improved reward outcomes.
Empirical evaluations on 13 benchmarks demonstrate ARPO’s 4–6% accuracy improvements and reduced tool-call budgets compared to traditional trajectory-level RL methods.

Agentic Reinforced Policy Optimization: Entropy-Guided RL for Multi-Turn LLM Agents

Motivation and Problem Formulation

The paper introduces Agentic Reinforced Policy Optimization (ARPO), a reinforcement learning (RL) algorithm designed to address the limitations of trajectory-level RL in training LLM agents for multi-turn, tool-augmented reasoning tasks. Existing RL approaches for LLMs, such as GRPO and DAPO, typically operate at the trajectory level, providing reward signals only at the end of complete tool-use trajectories. This paradigm is suboptimal for agentic settings where LLMs interact with external tools (e.g., search engines, code interpreters, browsers) in multi-step, dynamic environments. The authors identify a key empirical observation: LLMs exhibit a sharp increase in token entropy immediately after tool usage, indicating heightened uncertainty and under-explored behavioral modes at these steps.

Figure 1: High token entropy is observed in LLMs following tool usage (left), and ARPO achieves superior performance on deep search tasks with only 1k RL samples and reduced tool-use budgets (right).

ARPO Algorithmic Framework

ARPO is constructed around two core innovations: entropy-based adaptive rollout and advantage attribution estimation.

Entropy-Based Adaptive Rollout

Unlike standard trajectory-level sampling, ARPO dynamically interleaves global trajectory sampling with step-level partial rollouts, triggered by entropy spikes after tool calls. The mechanism operates as follows:

Initialization: For each input, $N$ global trajectories are sampled, and the initial entropy of the first $k$ tokens is computed.
Entropy Monitoring: After each tool call, the model generates $k$ tokens and computes the step-level entropy. The normalized entropy change $\Delta H_t$ is used as a branching criterion.
Adaptive Beaming: If $P_t = \alpha + \beta \cdot \Delta H_t$ exceeds a threshold $\tau$ , $Z$ additional partial rollouts are branched from the current state, focusing exploration on high-uncertainty regions.
Termination: The process continues until the partial sampling budget is exhausted or all paths terminate.

This approach expands the behavioral search space at critical tool-use junctures while maintaining computational efficiency, reducing rollout complexity from $O(n^2)$ to $O(n \log n)$ – $O(n^2)$ .

Figure 2: Schematic overview of the ARPO algorithm, highlighting the integration of global and entropy-triggered partial rollouts.

Advantage Attribution Estimation

ARPO's partial rollouts produce trajectories with both shared and unique token segments. The algorithm employs a group-relative policy optimization (GRPO) objective, assigning:

Shared tokens: The average advantage across all trajectories sharing the prefix.
Individual tokens: Distinct advantages based on the normalized reward of each branch.

Empirical results show that soft advantage estimation (via GRPO) yields higher and more stable rewards than hard assignment.

Figure 3: Left—Entropy-based adaptive beaming principle; Right—ARPO's advantage assignment differentiates shared and individual token segments in inter-group samples.

Theoretical Foundation

The authors generalize the policy gradient theorem to macro-action segments, showing that Transformer-based policies can be optimized over arbitrary macro-action splits (e.g., tool-use boundaries). This Generalized Policy Gradient (GPG) theorem provides a formal basis for ARPO's partial rollout and update strategy, subsuming standard token-level policy gradients as a special case.

Empirical Evaluation

ARPO is evaluated on 13 benchmarks spanning mathematical reasoning, knowledge-intensive QA, and deep search tasks. The experimental protocol includes:

Cold-start SFT: Supervised fine-tuning on open-source datasets.
RL phase: ARPO and baselines (GRPO, DAPO, REINFORCE++) are trained with identical budgets and tool-use constraints.

Key findings:

ARPO consistently outperforms trajectory-level RL algorithms across all domains, with average accuracy improvements of 4–6% and robust gains on both Qwen and Llama backbones.
Tool-use efficiency: ARPO achieves comparable or superior accuracy with only half the tool-call budget of trajectory-level methods, a critical consideration for real-world deployment.
Figure 4: ARPO-aligned Qwen3-8B and Qwen3-14B models show consistent improvements from Pass@1 to Pass@5, indicating enhanced sampling diversity and tool-use exploration.
Scaling analysis: Performance peaks at moderate entropy thresholds and balanced global/partial sampling ratios, confirming the necessity of tuning these hyperparameters for optimal exploration-exploitation trade-offs.
Figure 5: Scaling analysis of ARPO hyperparameters (entropy threshold, initial sampling size, global rollout size) on Qwen2.5-7B, demonstrating the algorithm's scalability and sensitivity.
Ablation on browser agents: Stronger browser agents yield higher accuracy in deep search, underscoring the importance of external tool capability in agentic RL.

Qualitative Case Studies

Case studies on mathematical, knowledge, and deep search tasks demonstrate ARPO's ability to:

Integrate multi-tool reasoning (e.g., search + code execution).
Efficiently branch and explore alternative reasoning paths at high-uncertainty steps.
Correctly attribute credit to critical tool-use decisions, leading to more reliable and interpretable agent behavior.

Implications and Future Directions

ARPO advances the state of RL for LLM-based agents by explicitly modeling and exploiting the uncertainty introduced by tool interactions. The entropy-guided partial rollout mechanism enables more sample-efficient and fine-grained alignment of agentic behaviors, particularly in long-horizon, multi-turn environments. The theoretical generalization to macro-action policy gradients opens avenues for further research on hierarchical RL and structured credit assignment in LLMs.

Practically, ARPO's tool-use efficiency and scalability make it suitable for real-world deployment in domains where tool calls are expensive or rate-limited (e.g., web search, code execution, API-based environments). The approach is compatible with a wide range of LLM backbones and can be integrated with advanced tool orchestration frameworks.

Future work may explore:

Automated entropy threshold tuning and adaptive budget allocation.
Extension to multimodal and embodied agent settings.
Integration with preference-based RL and human-in-the-loop feedback for richer alignment objectives.
Theoretical analysis of exploration-exploitation dynamics in high-entropy branching regimes.

Conclusion

Agentic Reinforced Policy Optimization (ARPO) provides a principled and empirically validated framework for training multi-turn, tool-augmented LLM agents. By leveraging entropy-based adaptive rollouts and structured advantage attribution, ARPO achieves superior performance and tool-use efficiency compared to trajectory-level RL baselines. The algorithm's scalability, theoretical grounding, and practical impact position it as a strong candidate for the next generation of agentic LLM training paradigms.