Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 161 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 127 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 435 tok/s Pro
Claude Sonnet 4.5 26 tok/s Pro
2000 character limit reached

Tool Policy Optimization (ToolPO)

Updated 28 October 2025
  • Tool Policy Optimization (ToolPO) is a framework for optimizing agent decision-making in multi-turn environments by leveraging reinforcement learning for tool invocations.
  • It employs a two-level reward mechanism and group-relative advantage attribution to ensure precise tool-use behavior and stabilize policy updates.
  • The approach integrates simulated API interactions and autonomous memory folding to generalize across labeled and open-set scenarios, achieving superior benchmark success rates.

Tool Policy Optimization (ToolPO) formulates the problem of learning optimal decision-making policies in environments where agents interact with external tools or APIs during multi-turn, possibly long-horizon reasoning and acting. Recent developments have focused on integrating reinforcement learning frameworks with fine-grained credit attribution, reward schemes tailored for tool-invoking actions, and stabilization mechanisms that enable generalization across both labeled-tool and open-set tool scenarios (Li et al., 24 Oct 2025). Below, key architectural and algorithmic components of ToolPO as used in DeepAgent are detailed along with contextual insights.

1. Reward Structuring and Attribution

ToolPO utilizes a two-level reward mechanism to align agent policy updates with both global task accomplishment and precise tool-use behavior. For each trajectory τ\tau:

  • Global Success Reward: Rsucc(τ)R_{\mathrm{succ}}(\tau) evaluates overall task completion (e.g., correct final answer).
  • Action-Level Reward: Raction(τ)=λ1t=1TC(atcall)+λ2Spref(τ)R_{\mathrm{action}}(\tau) = \lambda_1 \sum_{t=1}^T C(a^{\mathrm{call}}_t) + \lambda_2 S_{\mathrm{pref}}(\tau), where C(atcall)=1C(a^{\mathrm{call}}_t)=1 iff the tool call at time tt is correct; Spref(τ)S_{\mathrm{pref}}(\tau) measures efficiencies such as optimal memory folding; and λ1\lambda_1, λ2\lambda_2 are weighting hyperparameters.

This reward design provides signal not just for the ultimate task outcome but specifically for intermediate tool invocation and agentic memory behaviors.

2. Group-Relative Advantage Attribution

To prevent variance escalation and error accumulation typical of sparse-reward long-horizon sequences, ToolPO adopts group-relative advantage computation. For a batch of KK sampled trajectories {τk}k=1K\{\tau_k\}_{k=1}^K:

Asucc(τk)=Rsucc(τk)1Kj=1KRsucc(τj)A_{\mathrm{succ}}(\tau_k) = R_{\mathrm{succ}}(\tau_k) - \frac{1}{K} \sum_{j=1}^K R_{\mathrm{succ}}(\tau_j)

Aaction(τk)=Raction(τk)1Kj=1KRaction(τj)A_{\mathrm{action}}(\tau_k) = R_{\mathrm{action}}(\tau_k) - \frac{1}{K} \sum_{j=1}^K R_{\mathrm{action}}(\tau_j)

A binary mask M(yi)M(y_i) is used at the token level to isolate credit assignment for tool-call and memory-fold operations: A(yi)=Asucc(τk)+M(yi)Aaction(τk)A(y_i) = A_{\mathrm{succ}}(\tau_k) + M(y_i) \cdot A_{\mathrm{action}}(\tau_k)

This fine-grained mechanism attributes learning signal only to tokens responsible for tool interaction or memory folding, sharpening policy update fidelity (Li et al., 24 Oct 2025).

3. Policy Optimization Objective

ToolPO applies a clipped surrogate objective—derived from the PPO family—for policy updates: LToolPO(θ)=Eτ[i=1τmin(ρi(θ)A(yi),clip(ρi(θ),1ϵ,1+ϵ)A(yi))]\mathcal{L}_{\mathrm{ToolPO}}(\theta) = \mathbb{E}_\tau \Bigg[ \sum_{i=1}^{|\tau|} \min\big(\rho_i(\theta) \cdot A(y_i), \operatorname{clip}(\rho_i(\theta), 1-\epsilon, 1+\epsilon) \cdot A(y_i)\big)\Bigg] where

ρi(θ)=πθ(yiy<i,s)πθold(yiy<i,s)\rho_i(\theta) = \frac{\pi_\theta(y_i|y_{<i}, s)}{\pi_{\theta_{\mathrm{old}}}(y_i|y_{<i}, s)}

This update rule maintains policy stability while prioritizing exploration of potential tool-use and compression decisions aligned with positive outcome and efficient process reward.

4. LLM-Simulated API Interactions

A distinctive feature of ToolPO is training in an environment with LLM-generated API simulators. The agent interacts with “simulated APIs” that mirror real-world tool responses (such as those from RapidAPI), enabling rapid, reproducible, and consistent reward delivery over diverse toolsets. This mitigates the inefficiency and instability typically encountered during live API-based RL, facilitating scalable training across thousands of APIs (Li et al., 24 Oct 2025).

5. Long-Horizon Reasoning and Memory Folding

ToolPO supports autonomous “memory folding” to compress lengthy interaction history into episodic, working, and tool memories. The agent decides when to invoke a folding action, which is rewarded for improving trajectory length and reducing error propagation. Combined with token-level credit assignment focused on tool-call and memory-fold tokens, ToolPO ensures granular learning signal propagation across long-horizon, multi-tool trajectories.

6. Experimental Validation

DeepAgent with ToolPO demonstrates measurable improvements across standard tool-use and downstream reasoning benchmarks:

Benchmark Baseline Success Rate DeepAgent-ToolPO Success Rate
TMDB ~55% ~89%
Spotify ~52.6% ~75.4%

Further, gains extend to open-set tool retrieval and labeled-tool scenarios (ToolBench, API-Bank, ToolHop), as well as downstream applications (ALFWorld, WebShop, GAIA, HLE), with ablation studies confirming the necessity of each architectural piece (memory folding, LLM-simulated APIs, and advantage attribution).

7. Training Loop via Pseudocode

A concise pseudocode representation encapsulates the core ToolPO loop:

1
2
3
4
5
6
7
8
9
10
11
12
for each training iteration:
    for each prompt:
        sample K trajectories {tau_1, ..., tau_K} using policy pi_theta
        compute R_succ and R_action for each trajectory
        compute group-relative A_succ and A_action
        for each token y_i in trajectory:
            if token is part of tool-call or memory-fold:
                A(y_i) = A_succ + A_action
            else:
                A(y_i) = A_succ
        update theta via gradient ascent using:
            L_ToolPO(theta) = sum_i min(rho_i(theta) * A(y_i), clip(rho_i(theta), 1-eps, 1+eps) * A(y_i))
All components above strictly reflect details in (Li et al., 24 Oct 2025).

8. Implications and Significance

ToolPO demonstrates that integrating outcome-driven RL objectives directly with token-level, process-sensitive reward signals enables robust, generalizable, and scalable tool use in agentic systems. Autonomous memory management further reduces context explosion and error propagation, and simulated tool APIs provide a tractable training environment that supports extensible toolset generalization. This formulation advances the state of the art for practical multi-turn tool-integrated policy optimization, with empirical effectiveness validated on a suite of industry-standard benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Tool Policy Optimization (ToolPO).