Tool Policy Optimization (ToolPO)
- Tool Policy Optimization (ToolPO) is a framework for optimizing agent decision-making in multi-turn environments by leveraging reinforcement learning for tool invocations.
- It employs a two-level reward mechanism and group-relative advantage attribution to ensure precise tool-use behavior and stabilize policy updates.
- The approach integrates simulated API interactions and autonomous memory folding to generalize across labeled and open-set scenarios, achieving superior benchmark success rates.
Tool Policy Optimization (ToolPO) formulates the problem of learning optimal decision-making policies in environments where agents interact with external tools or APIs during multi-turn, possibly long-horizon reasoning and acting. Recent developments have focused on integrating reinforcement learning frameworks with fine-grained credit attribution, reward schemes tailored for tool-invoking actions, and stabilization mechanisms that enable generalization across both labeled-tool and open-set tool scenarios (Li et al., 24 Oct 2025). Below, key architectural and algorithmic components of ToolPO as used in DeepAgent are detailed along with contextual insights.
1. Reward Structuring and Attribution
ToolPO utilizes a two-level reward mechanism to align agent policy updates with both global task accomplishment and precise tool-use behavior. For each trajectory :
- Global Success Reward: evaluates overall task completion (e.g., correct final answer).
- Action-Level Reward: , where iff the tool call at time is correct; measures efficiencies such as optimal memory folding; and , are weighting hyperparameters.
This reward design provides signal not just for the ultimate task outcome but specifically for intermediate tool invocation and agentic memory behaviors.
2. Group-Relative Advantage Attribution
To prevent variance escalation and error accumulation typical of sparse-reward long-horizon sequences, ToolPO adopts group-relative advantage computation. For a batch of sampled trajectories :
A binary mask is used at the token level to isolate credit assignment for tool-call and memory-fold operations:
This fine-grained mechanism attributes learning signal only to tokens responsible for tool interaction or memory folding, sharpening policy update fidelity (Li et al., 24 Oct 2025).
3. Policy Optimization Objective
ToolPO applies a clipped surrogate objective—derived from the PPO family—for policy updates: where
This update rule maintains policy stability while prioritizing exploration of potential tool-use and compression decisions aligned with positive outcome and efficient process reward.
4. LLM-Simulated API Interactions
A distinctive feature of ToolPO is training in an environment with LLM-generated API simulators. The agent interacts with “simulated APIs” that mirror real-world tool responses (such as those from RapidAPI), enabling rapid, reproducible, and consistent reward delivery over diverse toolsets. This mitigates the inefficiency and instability typically encountered during live API-based RL, facilitating scalable training across thousands of APIs (Li et al., 24 Oct 2025).
5. Long-Horizon Reasoning and Memory Folding
ToolPO supports autonomous “memory folding” to compress lengthy interaction history into episodic, working, and tool memories. The agent decides when to invoke a folding action, which is rewarded for improving trajectory length and reducing error propagation. Combined with token-level credit assignment focused on tool-call and memory-fold tokens, ToolPO ensures granular learning signal propagation across long-horizon, multi-tool trajectories.
6. Experimental Validation
DeepAgent with ToolPO demonstrates measurable improvements across standard tool-use and downstream reasoning benchmarks:
| Benchmark | Baseline Success Rate | DeepAgent-ToolPO Success Rate |
|---|---|---|
| TMDB | ~55% | ~89% |
| Spotify | ~52.6% | ~75.4% |
Further, gains extend to open-set tool retrieval and labeled-tool scenarios (ToolBench, API-Bank, ToolHop), as well as downstream applications (ALFWorld, WebShop, GAIA, HLE), with ablation studies confirming the necessity of each architectural piece (memory folding, LLM-simulated APIs, and advantage attribution).
7. Training Loop via Pseudocode
A concise pseudocode representation encapsulates the core ToolPO loop:
1 2 3 4 5 6 7 8 9 10 11 12 |
for each training iteration: for each prompt: sample K trajectories {tau_1, ..., tau_K} using policy pi_theta compute R_succ and R_action for each trajectory compute group-relative A_succ and A_action for each token y_i in trajectory: if token is part of tool-call or memory-fold: A(y_i) = A_succ + A_action else: A(y_i) = A_succ update theta via gradient ascent using: L_ToolPO(theta) = sum_i min(rho_i(theta) * A(y_i), clip(rho_i(theta), 1-eps, 1+eps) * A(y_i)) |
8. Implications and Significance
ToolPO demonstrates that integrating outcome-driven RL objectives directly with token-level, process-sensitive reward signals enables robust, generalizable, and scalable tool use in agentic systems. Autonomous memory management further reduces context explosion and error propagation, and simulated tool APIs provide a tractable training environment that supports extensible toolset generalization. This formulation advances the state of the art for practical multi-turn tool-integrated policy optimization, with empirical effectiveness validated on a suite of industry-standard benchmarks.