Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs (2504.11536v2)

Published 15 Apr 2025 in cs.CL and cs.AI

Abstract: While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.

ReTool (Feng et al., 15 Apr 2025 ) is a research paper that addresses the limitations of LLMs in structured problem-solving tasks, such as mathematical reasoning, geometry, or complex equation solving, where computational tools like code interpreters excel. While recent RL-trained LLMs show strong textual reasoning, they often struggle with the precision and symbolic manipulation required for these tasks.

The paper proposes ReTool, a novel reinforcement learning (RL) framework designed to enhance LLMs' strategic tool use capabilities by integrating a code interpreter into the reasoning process. ReTool has two core features:

  1. Dynamic Interleaving: It allows for the real-time execution of code within the natural language reasoning flow.
  2. Automated RL Paradigm: It enables policy rollouts that incorporate multi-turn real-time code execution, allowing the model to learn when and how to invoke tools based on the feedback received from the code interpreter.

The methodology consists of two main stages:

  1. Cold Start Supervised Fine-Tuning (SFT): This stage builds a foundational capability for tool use. A pipeline is designed to automatically construct a high-quality dataset of code-integrated reasoning traces (DCI\mathcal{D}_{CI}). This involves starting with existing mathematical reasoning data, filtering it, and then transforming textual calculation steps into corresponding code snippets and their execution results using a structured prompt template. A two-stage verification process (format and answer verification) ensures data quality. An LLM is then fine-tuned on this dataset to learn basic tool invocation and analysis of execution results.
  2. Reinforcement Learning for Strategic Tool Use: Using the SFT-tuned model as an initialization, the RL stage further refines the model's tool-use strategy. The training utilizes the PPO algorithm adapted for tool integration. During rollouts, the policy LLM interacts with an external code sandbox. When the model generates code within specified tags (e.g., <code><code>), the code is executed in the sandbox. The sandbox output (results or errors) is then returned to the model within other tags (e.g., <interpreter><interpreter>), allowing the model to continue its reasoning based on this feedback. A simple rule-based accuracy reward (1 for correct final answer, -1 otherwise) is used to guide the learning process, encouraging the model to discover effective tool usage patterns without complex reward shaping. Implementation details include masking interpreter feedback from the loss calculation for training stability and using KV-cache reuse for memory efficiency during rollouts. An asynchronous code sandbox environment is employed for parallel execution and faster training.

ReTool was evaluated on the challenging MATH Olympiad benchmarks AIME2024 and AIME2025. Using Qwen2.5-32B-Instruct as the base model, ReTool achieved 67.0% accuracy on AIME2024 with just 400 training steps, significantly outperforming a text-based RL baseline (40.0% accuracy with over 1000 steps). When trained on a stronger backbone, DeepSeek-R1-Distill-Qwen-32B, ReTool reached 72.5% accuracy on AIME2024, surpassing competitive models like OpenAI o1-preview (44.6%) and s1-32B (56.7%). The cold-start model achieved performance similar to the text-based RL baseline, confirming its effectiveness as an initial step.

A comprehensive analysis of the model's behavior during RL training revealed several key insights:

  • Efficiency: The average response length decreased by approximately 40% after RL training, indicating that replacing lengthy textual calculations with code improves token efficiency.
  • Code Proficiency: Metrics like code ratio (proportion of responses containing code) and average code lines increased significantly, showing enhanced code utilization and complexity.
  • Improved Tool Use: The total correct code counts on the test set showed an upward trend, reflecting better execution of generated code. The timing of code invocation shifted earlier in the reasoning process, suggesting the model learned optimal points for tool intervention.
  • Emergent Behaviors: The model demonstrated emergent capabilities such as code self-correction, identifying and fixing errors (like missing imports) based on interpreter feedback, even without explicit training data for this behavior.
  • Diverse Strategies: Analysis of code purpose showed increased diversity after RL training, indicating adaptive tool selection and improved generalizability across different types of problems.

In summary, ReTool successfully integrates reinforcement learning with code interpreter execution to enable LLMs to develop strategic and efficient tool-use capabilities for complex mathematical reasoning, achieving state-of-the-art performance on challenging benchmarks and demonstrating promising emergent behaviors.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jiazhan Feng (11 papers)
  2. Shijue Huang (14 papers)
  3. Xingwei Qu (30 papers)
  4. Ge Zhang (170 papers)
  5. Yujia Qin (41 papers)
  6. Baoquan Zhong (3 papers)
  7. Chengquan Jiang (7 papers)
  8. Jinxin Chi (2 papers)
  9. Wanjun Zhong (49 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com