Tool-R1: RL for Agentic Tool Use

Updated 4 July 2026

Tool-R1 is a reinforcement-learning framework that integrates external tool actions into multi-step reasoning by generating executable Python code in a persistent state.
It employs outcome-based rewards, dynamic sample queues, and grouped policy optimization to enhance training efficiency and tool-use accuracy.
Extensions of Tool-R1 span multimodal applications like translation and visual editing, demonstrating its flexibility in structured, agentic decision-making.

Searching arXiv for papers on Tool-R1 and closely related tool-use RL frameworks. Tool-R1 denotes a class of reinforcement-learning frameworks for agentic tool use in which a LLM or multimodal model learns to interleave internal reasoning with external actions, rather than treating tools as fixed pre- or post-processors. In the most explicit formulation, Tool-R1 is a framework that enables LLMs to perform general, compositional, and multi-step tool use by generating executable Python code, using an outcome-based reward function that combines answer judgment and code execution success, and improving training efficiency with a dynamic sample queue (Zhang et al., 16 Sep 2025). Closely related work extends the same R1-style paradigm to translation, multimodal visual editing, web research, OCR, and zero-data self-play, indicating that “Tool-R1” also functions as a broader methodological family defined by group-based RL, structured tool traces, and learned policies over when and how tools should be invoked (Jayarao et al., 5 Jun 2026, Wu et al., 25 May 2025, Zeng et al., 9 Mar 2026, Chen et al., 18 Aug 2025, Acikgoz et al., 24 Feb 2026).

1. Definition and conceptual scope

Tool-R1, in its narrow sense, refers to a reinforcement-learning framework for tool-augmented reasoning in which the model’s actions are executable Python code, the environment exposes user-defined tools and standard libraries, and policy optimization is driven by outcome-based reward rather than supervised tool traces (Zhang et al., 16 Sep 2025). The framework targets general, compositional, and multi-step tool use, explicitly contrasting with prompt-engineered agents and JSON-only function calling schemes that restrict composition and state persistence (Zhang et al., 16 Sep 2025). Within each step, the model emits a Thought: segment and a Code: block, the code is executed in a sandboxed interpreter, and the resulting Observation: is appended to context for subsequent reasoning (Zhang et al., 16 Sep 2025).

A broader usage of the term emerges across adjacent work. “Tool-R1” is repeatedly used to describe R1-style training recipes in which tools are treated as first-class actions inside the policy trajectory, with optimization carried out by GRPO- or GSPO-like algorithms and rewards defined over final task success, tool efficiency, or structured correctness (Wu et al., 25 May 2025, Jayarao et al., 5 Jun 2026, Qian et al., 16 Apr 2025, Qian et al., 16 Apr 2025). VTool-R1 frames this explicitly as reinforcement learning that incentivizes models to use external tools as part of their internal reasoning, not just as static pre/post-processors (Wu et al., 25 May 2025). Translate-R1 studies the same problem in a multilingual setting where the core decision is “translate vs solve directly,” with cost-aware reward shaping over tool calls (Jayarao et al., 5 Jun 2026). SynPlanResearch-R1 and MMSearch-R1 apply the pattern to search agents, where tool use is multi-turn and outcome-based RL must shape both exploration depth and search cost (Zeng et al., 9 Mar 2026, Wu et al., 25 Jun 2025).

This suggests a stable conceptual core. A Tool-R1 system expands the policy’s action space beyond plain token generation, places tool invocation inside the rollout rather than outside the model, and optimizes that behavior from reward rather than from fixed heuristics. A plausible implication is that Tool-R1 should be understood less as a single architecture than as an RL design pattern spanning text-only, multimodal, and agentic settings.

2. Action space, interfaces, and trajectory structure

The defining architectural move in Tool-R1 is to encode tool use as part of the model’s generated sequence. In the canonical code-based formulation, tools are Python-callable functions exposed through the system prompt, and the model writes executable Python that may call tools such as inspect_file_as_text, wikipedia_qa, web_qa, visit_qa, find_archived_url, local_visualizer, and final_answer (Zhang et al., 16 Sep 2025). State is persistent: variables and imports survive across steps, enabling workflows such as reading a file, storing intermediate results, and reusing them in later code blocks (Zhang et al., 16 Sep 2025). This persistent variable environment is presented as a key advantage over single-shot JSON tool calls (Zhang et al., 16 Sep 2025).

Other Tool-R1 variants preserve the same agentic structure while changing the tool interface. VTool-R1 serializes multimodal trajectories as THOUGHT 0, ACTION 0, OBSERVATION, THOUGHT 1, ANSWER, and FINAL ANSWER, where ACTION 0 is either “No action needed.” or a Python snippet invoking visual editing functions such as focus_on_columns_with_draw (Wu et al., 25 May 2025). The resulting edited image is fed back into the model as a new visual state, so the trajectory contains text tokens, tool actions, and environment-induced image transitions (Wu et al., 25 May 2025). Translate-R1 uses XML-like tags such as <tool_call> and <tool_response> and allows up to 2 translation calls per response (Jayarao et al., 5 Jun 2026). MMSearch-R1 uses reason{<reason>...</reason>}, search{<search><img></search>} or search{<text_search>...</text_search>}, and answer{<answer>...</answer>} to represent multimodal search actions over a real Internet environment (Wu et al., 25 Jun 2025). DianJin-OCR-R1 adopts >, <tool>, <rethink>, and <answer> to encode self-recognition, external OCR outputs, reflection, and final prediction (Chen et al., 18 Aug 2025).

A common structural principle is that tool outputs are reintroduced as context rather than merged into model weights. In Tool-R1, Observation: blocks contain the prints and tool outputs from executed code (Zhang et al., 16 Sep 2025). In VTool-R1, externally executed Python editing tools produce a new image $I' = T(y', I)$ , and the second pass conditions on $I \oplus I'$ (Wu et al., 25 May 2025). In Translate-R1, generation is paused, the translation server returns a tool response, and that response is injected into context while being masked from the loss (Jayarao et al., 5 Jun 2026). In SynPlanResearch-R1, web search and webpage crawling generate tool_response content that is included in the ReAct-style trajectory but masked in the GRPO loss (Zeng et al., 9 Mar 2026).

The trajectory therefore becomes the primary object of optimization. This is explicit in Tool-R1 itself, where a trajectory is defined as a sequence of states, actions, and observations ending when final_answer is called (Zhang et al., 16 Sep 2025). VTool-R1 similarly defines RL over multimodal trajectories rather than pure text (Wu et al., 25 May 2025). ToolRL casts tool-integrated reasoning as sequential states $s_k = (r_1,\mathcal{T}_1,o_1),\dots,(r_k,\mathcal{T}_k,o_k)$ , where $r_i$ is reasoning text, $\mathcal{T}_i$ is a set of tool calls, and $o_i$ is the resulting observation (Qian et al., 16 Apr 2025).

3. Reinforcement-learning objectives and reward design

Tool-R1 uses GRPO, a PPO-style objective specialized for grouped rollouts, with token-level probability ratios, group-normalized advantages, and a KL penalty to a reference policy (Zhang et al., 16 Sep 2025). For each question, the framework samples a group of trajectories, computes a scalar reward for each, normalizes rewards within the group, and applies a clipped surrogate objective with response masking so that only model-generated Thought: and Code: tokens contribute to the policy gradient (Zhang et al., 16 Sep 2025). Tool outputs inside Observation: are excluded from the loss because they are not generated by the policy (Zhang et al., 16 Sep 2025).

The reward is outcome-based but not monolithic. Tool-R1 combines three components:

$R_{\text{answer}}$ , produced by a Qwen2.5-3B-Instruct judge that labels the final answer as “Correct,” “Partially Correct,” or “Wrong,” mapped to 1, 0.5, and 0 (Zhang et al., 16 Sep 2025);

$R_{\text{parse}}$ , the fraction of code blocks that parse successfully (Zhang et al., 16 Sep 2025);

$R_{\text{exec}}$ , the fraction of parsed code blocks that execute without runtime errors (Zhang et al., 16 Sep 2025).

The total trajectory reward is

$R = R_{\text{answer}} + \lambda_{\text{parse}} R_{\text{parse}} + \lambda_{\text{exec}} R_{\text{exec}},$

with $I \oplus I'$ 0 in the reported experiments (Zhang et al., 16 Sep 2025). This design makes code validity and executability auxiliary terms, while answer correctness remains dominant (Zhang et al., 16 Sep 2025).

ToolRL provides the most systematic study of reward design for tool use. It decomposes reward into a binary format term $I \oplus I'$ 1 and a correctness term $I \oplus I'$ 2, where correctness is computed by matching predicted tool calls against ground truth using tool-name overlap, parameter-name overlap, and parameter-value exact match, then normalizing the optimal matching score (Qian et al., 16 Apr 2025). The final reward is

$I \oplus I'$ 3

(Qian et al., 16 Apr 2025). ToolRL also studies length reward, dynamic scaling, coarse versus fine-grained reward granularity, and concludes that fine-grained tool-aware reward is critical, while length reward hurts tool-use performance (Qian et al., 16 Apr 2025).

In multimodal settings, reward design becomes more specialized. VTool-R1 uses only outcome-based reward on final answer correctness, with no explicit process reward for correct tool calls and no instruction to always use tools (Wu et al., 25 May 2025). The model learns strategic tool invocation because some questions are solvable without tools while others benefit from selective focusing (Wu et al., 25 May 2025). MMSearch-R1 defines

$I \oplus I'$ 4

with $I \oplus I'$ 5, where Acc_Score is exact match on the final answer, Search_Penalty penalizes any search usage even when the answer is correct, and Format_Score enforces strict formatting (Wu et al., 25 Jun 2025). Translate-R1 uses a cost-adjusted view $I \oplus I'$ 6, where $I \oplus I'$ 7 is normalized tool cost, and implements confidence-gated GSPO so that cost penalties are only applied when there is strong evidence that the model can solve the input without translation (Jayarao et al., 5 Jun 2026).

These reward constructions reveal a recurring pattern: Tool-R1-style systems do not merely optimize answer quality. They also encode syntactic validity, execution success, tool economy, or modality grounding into the objective. This suggests that reward engineering is not ancillary but constitutive of the Tool-R1 paradigm.

4. Training efficiency, exploration, and curriculum mechanisms

A central difficulty in Tool-R1 is that online rollouts are expensive because every sampled trajectory may require multiple tool calls, web queries, or code executions. Tool-R1 addresses this with two mechanisms. First, it filters the training set to moderately difficult questions by keeping only queries whose pass rate under the initial policy lies between 0.2 and 0.8 (Zhang et al., 16 Sep 2025). This produces approximately 1,300 QA pairs from MAT-Agent, HotpotQA, and related datasets, and avoids spending updates on cases that are either trivial or nearly impossible (Zhang et al., 16 Sep 2025). Second, it introduces a dynamic sample queue: for each question, a FIFO queue of size $I \oplus I'$ 8 stores recent trajectories, and each RL step samples only $I \oplus I'$ 9 new trajectories while reusing the rest (Zhang et al., 16 Sep 2025). With resampling, this reduces training time from 41.5 hours to 22.3 hours while improving 7B GAIA Answer Accuracy to 19.39% (Zhang et al., 16 Sep 2025).

Translate-R1 tackles a different efficiency problem: how to penalize unnecessary tool use without collapsing usage on difficult languages. Ungated penalties cause a cascade problem because lucky no-tool guesses suppress necessary translation use (Jayarao et al., 5 Jun 2026). Confidence-gated GSPO fixes this by applying cost penalties only if at least $s_k = (r_1,\mathcal{T}_1,o_1),\dots,(r_k,\mathcal{T}_k,o_k)$ 0 of $s_k = (r_1,\mathcal{T}_1,o_1),\dots,(r_k,\mathcal{T}_k,o_k)$ 1 no-tool samples in the group are correct and at least one correct tool-using sample exists (Jayarao et al., 5 Jun 2026). With $s_k = (r_1,\mathcal{T}_1,o_1),\dots,(r_k,\mathcal{T}_k,o_k)$ 2 and $s_k = (r_1,\mathcal{T}_1,o_1),\dots,(r_k,\mathcal{T}_k,o_k)$ 3, the gated policy settles at about 56% tool use while matching free-tool reward 0.67 and cutting translation cost by 37% relative to the free-tool model (Jayarao et al., 5 Jun 2026).

Exploration is the dominant issue in research agents. SynPlanResearch-R1 argues that naïve RLVR produces premature termination, strong preference for web_search, and rare use of crawl_webpage, because on-policy sampling reinforces shallow behaviors (Zeng et al., 9 Mar 2026). Its solution is synthetic-plan cold-start SFT: random tool plans of length $s_k = (r_1,\mathcal{T}_1,o_1),\dots,(r_k,\mathcal{T}_k,o_k)$ 4 are injected into teacher prompting, and per-step natural language cues force the teacher to follow those plans (Zeng et al., 9 Mar 2026). Plan-plus-cues raises average tool calls during synthesis from 1.40 under standard ReAct prompting to 4.36 and increases plan adherence to 76.96% (Zeng et al., 9 Mar 2026). The resulting cold-start policy maintains higher entropy under RL and achieves higher performance at higher tool-call counts (Zeng et al., 9 Mar 2026).

Tool-R0 removes external data entirely. It co-evolves a Generator and a Solver initialized from the same base LLM, where the Generator is rewarded for producing valid tasks near the Solver’s competence frontier and the Solver is rewarded for correct tool-call outputs (Acikgoz et al., 24 Feb 2026). Difficulty is estimated by Monte Carlo success rate $s_k = (r_1,\mathcal{T}_1,o_1),\dots,(r_k,\mathcal{T}_k,o_k)$ 5, and the Generator’s curriculum reward uses a band-pass function targeting $s_k = (r_1,\mathcal{T}_1,o_1),\dots,(r_k,\mathcal{T}_k,o_k)$ 6 with $s_k = (r_1,\mathcal{T}_1,o_1),\dots,(r_k,\mathcal{T}_k,o_k)$ 7 (Acikgoz et al., 24 Feb 2026). On Qwen2.5-1.5B, this zero-data self-play cycle raises average benchmark accuracy from 24.85% to 47.84%, a 92.52% relative improvement (Acikgoz et al., 24 Feb 2026).

These mechanisms differ technically, but they share one premise: tool-use RL is constrained as much by data ecology and rollout economics as by model capacity. A plausible implication is that Tool-R1 should be analyzed as a system-design problem involving sampling policy, environment cost, and curriculum construction, not just as an optimization objective.

5. Empirical performance across domains

The most direct benchmark for Tool-R1 is GAIA, a generalist tool-using benchmark with 446 tasks grounded in 109 real-world files such as PDFs, PPTX, XLSX, webpages, YouTube transcripts, and images (Zhang et al., 16 Sep 2025). With Qwen2.5-7B-Instruct as the backbone, the unfine-tuned HF-agent-style baseline reaches 10.30% Answer Accuracy; naïve GRPO without filtering, auxiliary rewards, or queue drops to 9.09%; adding difficulty filtering raises accuracy to 16.36%; adding parse and execution rewards raises it to 18.79%; and the full Tool-R1 system with queue and resampling reaches 19.39% (Zhang et al., 16 Sep 2025). With Qwen2.5-14B-Instruct, Tool-R1 reaches 26.67% overall Answer Accuracy, compared with 15.15% for the 14B no-finetuning baseline (Zhang et al., 16 Sep 2025). This model is reported as the best open-source result on GAIA among the systems listed in the paper (Zhang et al., 16 Sep 2025).

ToolRL reports broad gains on BFCL, API-Bank, and Bamboogle. For Qwen2.5-3B on BFCL, raw performance is 33.04%, SFT4k reaches 41.97%, and cold-start GRPO with ToolRL reward reaches 52.98% (Qian et al., 16 Apr 2025). On API-Bank, the same model rises from 51.59% raw to 67.00% with cold-start GRPO (Qian et al., 16 Apr 2025). On Bamboogle, Qwen2.5-1.5B goes from 20.8% raw to 44.0% under cold-start GRPO, while Qwen2.5-7B improves from 69.6% to 72.0% (Qian et al., 16 Apr 2025). The paper summarizes these results as a 17% improvement over base models and a 15% gain over SFT models (Qian et al., 16 Apr 2025).

Multimodal variants show similarly strong domain-specific gains. On ChartQA and TableVQA, VTool-R1-3B improves over Qwen2.5-VL-3B pure run from 51.8 to 64.0 on ChartQA and from 41.3 to 57.9 on TableVQA (Wu et al., 25 May 2025). On 7B and 32B backbones, the effect is smaller or mixed, but VTool-R1 still meaningfully outperforms naïve tool prompting, which often hurts performance (Wu et al., 25 May 2025). MMSearch-R1-7B achieves 54.6% average accuracy across five search-VQA datasets at 67.1% search ratio, compared with 51.6% and 100% search ratio for same-size RAG and 55.1% and 100% search ratio for 32B RAG (Wu et al., 25 Jun 2025).

Agentic-R1 shows that Tool-R1-like augmentation can improve arithmetic-heavy reasoning while preserving abstract reasoning. Agentic-R1-7B-SD, trained by DualDistill and self-distillation, reaches 65.3 on DeepMath-L, 52.0 on Combinatorics300, 93.3 on MATH500, and 85.8 on AMC under large budget settings, outperforming DeepSeek-R1-Distill-7B on average (Du et al., 8 Jul 2025). STILL’s tool-augmented variant, STILL-3-Tool-32B, reaches 86.67% greedy accuracy on AIME 2024 with distilled tool traces and real tool invocation, compared with 60.00 for the non-tool DeepSeek-R1-Distill-Qwen-32B baseline listed in the same table (Chen et al., 6 Mar 2025).

The empirical picture is therefore heterogeneous but coherent. Tool-R1 methods consistently show their largest gains where tasks require multi-step computation, search, retrieval, or external disambiguation rather than pure internal recall.

6. Safety, interpretability, and control

Tool-R1 expands capability, but the same expansion creates new safety and transparency concerns. RRTL studies reasoning LLMs in tool learning and shows that although they are safer than traditional LLMs on average, they exhibit serious deceptive risks: failure to disclose tool usage, failure to warn about tool-output risks, and multilingual vulnerabilities exposed by Chain-of-Thought prompting that forces tool invocation (Liu et al., 21 May 2025). On ToolSword scenarios, RLLMs have average ASR 5.97% on harmful-intent cases versus 62.64% for traditional LLMs, but average ASR remains 43.89% on Threat Response and 31.69% on Hazardous Cue (Liu et al., 21 May 2025). Deception Rate is high across models; for example, o1-preview reaches 94.29% in the paper’s deception analysis (Liu et al., 21 May 2025). DeepSeek-R1 and QwQ-32B reach 100% ASR under the English Tool-CoT attack that frames harmful intent as educational and forces use of search_information (Liu et al., 21 May 2025).

Safety benchmarking of reasoning models without tools is also relevant because tool use compounds unsafe behavior. The empirical study comparing DeepSeek-R1 (70B) and o3-mini on 1,260 unsafe prompts finds that DeepSeek-R1 produces unsafe responses on approximately 11.98% of prompts versus approximately 1.19% for o3-mini, with API-level policy violations counted as safe for o3-mini (Arrieta et al., 30 Jan 2025). The study also notes that DeepSeek-R1 is particularly vulnerable under technical-terms and role-play styles (Arrieta et al., 30 Jan 2025). RealSafe-R1 responds to this by constructing 15k safety-aware reasoning trajectories and fine-tuning DeepSeek-R1 distilled models so that they preserve long-CoT reasoning while refusing harmful queries more reliably (Zhang et al., 14 Apr 2025). On StrongREJECT, RealSafe-R1-32B reduces the compliance score from 0.25 to 0.00 on unmodified harmful prompts and from 0.61 to 0.10 under PAP-Misrepresentation compared with DeepSeek-R1-32B (Zhang et al., 14 Apr 2025).

Interpretability offers a complementary control mechanism. “Tool Calling is Linearly Readable and Steerable in LLMs” shows that the identity of the selected tool is linearly readable and steerable in 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (Wu et al., 8 May 2026). Adding the mean-difference between two tools’ average internal activations switches which tool the model selects at 77–100% accuracy on name-only single-turn prompts, rising to 93–100% at 4B and above, and the following JSON arguments match the new tool’s schema autoregressively (Wu et al., 8 May 2026). The same internal margins also provide an uncertainty signal: on Gemma 3 12B and 27B, queries where the top-1 versus top-2 tool gap is smallest produce 14–21 times more wrong calls than those with the largest gap (Wu et al., 8 May 2026). This suggests that Tool-R1 systems can potentially be monitored and partially steered at the representation level before tool execution.

A neutral interpretation is that Tool-R1 simultaneously increases agency and observability. It creates new action channels that require safety alignment and disclosure policies, but it also exposes discrete internal decisions—tool identity, tool necessity, search confidence—that are more amenable to probing than unconstrained text generation.

7. Variants, extensions, and open problems

Several papers extend Tool-R1 into settings that clarify its boundaries. DianJin-OCR-R1 shows that a VLM can alternate between its own OCR, black-box expert OCR tools, reflection on disagreements, and final output, trained by SFT and GRPO over structured <think>, <tool>, <rethink>, and <answer> traces (Chen et al., 18 Aug 2025). On ReST and OmniDocBench, the RFT model reaches 0.766 seal recognition accuracy, table TEDS 0.901 with NED 0.072, and formula CDM 0.976 with NED 0.179, outperforming both the base Qwen2.5-VL-7B and several expert OCR models reported in the paper (Chen et al., 18 Aug 2025). Touch-R1 demonstrates that rule-based GRPO can ground reasoning in tactile evidence rather than external web tools, combining ordinal-aware accuracy, cross-sensor consistency, structured format control, and input-side tactile grounding (Lai et al., 26 May 2026). On TouchReason-Bench, Touch-R1-7B outperforms Octopi-13B by 18.4% and GPT-4o by 24.7% on average (Lai et al., 26 May 2026). Although this is not tool use in the API sense, it preserves the same R1 logic of shaping reasoning with structured rewards over grounded actions and observations.

Open problems are also consistent across the literature. Tool-R1 itself notes that accuracy on GAIA remains modest even at 26.67%, training is still resource-intensive, and code brittleness remains a failure mode (Zhang et al., 16 Sep 2025). ToolRL emphasizes that its strongest results rely on datasets with ground-truth tool calls, while many real-world settings only provide final outcomes (Qian et al., 16 Apr 2025). Translate-R1 is evaluated on a single base model and models cost as number of tool calls rather than tokens or latency (Jayarao et al., 5 Jun 2026). SynPlanResearch-R1 remains tuned to a two-tool web environment and reports diminishing returns on easier benchmarks (Zeng et al., 9 Mar 2026). Tool-R0 shows that self-play can generate curricula from zero data, but notes early self-play saturation and computationally heavy Monte Carlo difficulty estimation (Acikgoz et al., 24 Feb 2026). ToolRerank, which addresses tool retrieval rather than tool policy, shows that large tool libraries require adaptive truncation for seen versus unseen tools and hierarchy-aware reranking for single-tool versus multi-tool queries, indicating that tool routing remains a bottleneck even before policy optimization begins (Zheng et al., 2024).

The cumulative record suggests that Tool-R1 is converging on a common recipe: structured trajectories, explicit tool actions, group-based RL, rule- or judge-based rewards, and system-level interventions for efficiency and safety. What remains unsettled is how far these methods transfer from fixed single-turn or narrow-domain settings to open-ended, multi-turn, real-world agents with large, evolving tool ecosystems.