ToolPO: RL for LLM Tool Use

Updated 27 October 2025

ToolPO is an end-to-end reinforcement learning strategy designed for training large language model agents to perform general-purpose tool use with dynamic API toolsets.
It employs simulated APIs, token-level advantage attribution, and an autonomous memory folding mechanism to address instability and enable efficient long-horizon reasoning.
Experimental validation shows ToolPO’s superior performance and sample efficiency across diverse tool-use benchmarks within the DeepAgent framework.

ToolPO is an end-to-end reinforcement learning (RL) strategy for training large reasoning agents, introduced as the central optimization algorithm within the DeepAgent framework. The goal of ToolPO is to enable general-purpose tool use in LLMs across arbitrarily large toolsets, with stable, efficient learning over long-horizon, multi-step reasoning tasks where robust tool invocation and strategic memory management are critical (Li et al., 24 Oct 2025).

1. Problem Definition and Motivation

ToolPO addresses two primary challenges in training LLM agents for real-world tool use:

Training Instability and Cost in Real-World Tool Environments: Direct use of vast numbers of real APIs during RL training introduces expensive, slow, and sometimes unstable interactions. This undermines training stability and limits scale.
Sparse Reward and Credit Assignment for Tool Invocation: In complex, long-horizon tasks, reward signals from final outcomes alone are insufficient for learning effective intermediate tool-use actions, especially when multiple decisions (tool calls and memory operations) jointly determine success.

The ToolPO framework integrates LLM-simulated APIs—synthetic, high-fidelity models emulating real API responses—and introduces a fine-grained token-level advantage attribution mechanism. This detailed reward decomposition allows the policy to efficiently credit not only overall task success but also the effectiveness of individual API calls and auxiliary actions such as context memory folding.

2. LLM-Simulated APIs and Stable RL Environment

ToolPO employs an LLM-based tool simulator in place of live, external APIs during RL training. This approach offers several significant advantages:

Stability: Simulated endpoints provide low-variance, deterministic responses, eliminating noise and unpredictable latencies of real services.
Efficiency: Batched, in-memory API emulation dramatically accelerates training throughput, favoring large-scale sampling and group-based RL updates.
Safety: The environment is sandboxed, preventing the policy from issuing dangerous or costly real-world requests.

This approach allows DeepAgent to focus the learning signal on sequence-level decision making and tool invocation strategies without being confounded by extrinsic service failures.

3. Tool-Call Advantage Attribution Mechanism

A central innovation in ToolPO is decomposing the RL reward into global (task) and local (tool/action) advantages, enabling token-level credit assignment:

Global Task Success Reward ( $R_{\mathrm{succ}}$ ): Assessed at the end of each trajectory, capturing the overall correctness of the agent’s final answer or goal completion.
Local Action-Level Reward ( $R_{\mathrm{action}}$ ): Assessed at each point where a tool call or memory fold (compression) action has been made, scoring the action for correctness, utility, and, optionally, efficiency.

The group-relative advantage formulation is: $A_{\mathrm{succ}}(\tau_k) = R_{\mathrm{succ}}(\tau_k) - \frac{1}{K} \sum_{j=1}^K R_{\mathrm{succ}}(\tau_j)$

$A_{\mathrm{action}}(\tau_k) = R_{\mathrm{action}}(\tau_k) - \frac{1}{K} \sum_{j=1}^K R_{\mathrm{action}}(\tau_j)$

for $K$ sampled trajectories $\{\tau_1, ..., \tau_K\}$ from the policy.

Per-token advantages are computed as: $A(y_i) = A_{\mathrm{succ}}(\tau_k) + M(y_i) \cdot A_{\mathrm{action}}(\tau_k)$ where $M(y_i)$ is a mask that selects only those tokens corresponding to tool-call or memory-fold decisions.

The surrogate policy gradient update adopts a clipped objective (in the spirit of PPO): $\mathcal{L}_{\mathrm{ToolPO}}(\theta) = \mathbb{E}_{\tau_k}\left[ \sum_i \min\left(\rho_i(\theta) A(y_i), \operatorname{clip}(\rho_i(\theta), 1-\epsilon, 1+\epsilon) A(y_i) \right) \right]$ with $\rho_i(\theta) = \frac{\pi_\theta(y_i|\cdot)}{\pi_{\theta_{\text{old}}}(y_i|\cdot)}$ as the token-level policy ratio.

This targeted attribution ensures dense, discriminative feedback, directly rewarding tokens that produce the desired tool invocations and penalizing inefficient or incorrect actions.

4. Autonomous Memory Folding for Long-Horizon Reasoning

To address the “context explosion” induced by prolonged tool-use dialogue and multi-step plans, DeepAgent implements an autonomous memory folding mechanism:

Folding Trigger: At logical boundaries in the task—such as after a sub-goal is complete or unproductive exploration—the agent issues a special “fold” action.
Compression via LLM Summarization: The ancilliary LLM condenses the full prior transcript into structured “folded” memories:
- Episodic memory: Summarizes the global task trajectory and key decision points.
- Working memory: Encodes short-term, immediate context, such as the current focus or actionable next steps.
- Tool memory: Catalogs tool calls, arguments, and observed responses.
Structured Output: The memory is serialized in JSON for easy parsing and reuse.

This mechanism reduces sequence length, prevents error accumulation, and enables efficient, robust context management over arbitrarily long horizons.

5. Experimental Validation and Impact

ToolPO, as deployed within DeepAgent, is empirically validated across eight public tool-use benchmarks spanning both labeled-tool (known APIs) and open-set retrieval scenarios (dynamic or unknown APIs), including ToolBench, API-Bank, TMDB, Spotify, ToolHop, ALFWorld, WebShop, GAIA, and HLE (Li et al., 24 Oct 2025). The framework consistently exceeds the performance of previous agent baselines in both task completion and intermediate tool-use accuracy.

The use of synthetic APIs for training is shown to yield stable and efficient convergence. Fine-grained advantage attribution enables faster and more robust credit assignment, while memory folding supports long-form autonomous reasoning without excessive context growth.

6. Significance and Prospective Advances

ToolPO provides a rigorous methodology for training generalist LLM agents to make fine-grained tool-use decisions in complex environments with large, dynamic toolsets. By unifying LLM-simulated APIs, token-level advantage decomposition, and autonomous memory management, it establishes a highly sample-efficient, stable RL regime for language agent development.

Potential future directions alluded to in the original work include expanding to even more diverse or rapidly-evolving API sets, refining advantage attribution to cover additional agent skills, and scaling memory folding strategies for more complex, multi-modal agent architectures. The approach marks substantive progress towards universally capable, self-improving reasoning agents in real-world settings.

PDF Markdown Chat (Pro)

References (1)

DeepAgent: A General Reasoning Agent with Scalable Toolsets (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ToolPO.