ReTool (Feng et al., 15 Apr 2025 ) is a research paper that addresses the limitations of LLMs in structured problem-solving tasks, such as mathematical reasoning, geometry, or complex equation solving, where computational tools like code interpreters excel. While recent RL-trained LLMs show strong textual reasoning, they often struggle with the precision and symbolic manipulation required for these tasks.
The paper proposes ReTool, a novel reinforcement learning (RL) framework designed to enhance LLMs' strategic tool use capabilities by integrating a code interpreter into the reasoning process. ReTool has two core features:
- Dynamic Interleaving: It allows for the real-time execution of code within the natural language reasoning flow.
- Automated RL Paradigm: It enables policy rollouts that incorporate multi-turn real-time code execution, allowing the model to learn when and how to invoke tools based on the feedback received from the code interpreter.
The methodology consists of two main stages:
- Cold Start Supervised Fine-Tuning (SFT): This stage builds a foundational capability for tool use. A pipeline is designed to automatically construct a high-quality dataset of code-integrated reasoning traces (). This involves starting with existing mathematical reasoning data, filtering it, and then transforming textual calculation steps into corresponding code snippets and their execution results using a structured prompt template. A two-stage verification process (format and answer verification) ensures data quality. An LLM is then fine-tuned on this dataset to learn basic tool invocation and analysis of execution results.
- Reinforcement Learning for Strategic Tool Use: Using the SFT-tuned model as an initialization, the RL stage further refines the model's tool-use strategy. The training utilizes the PPO algorithm adapted for tool integration. During rollouts, the policy LLM interacts with an external code sandbox. When the model generates code within specified tags (e.g., ), the code is executed in the sandbox. The sandbox output (results or errors) is then returned to the model within other tags (e.g., ), allowing the model to continue its reasoning based on this feedback. A simple rule-based accuracy reward (1 for correct final answer, -1 otherwise) is used to guide the learning process, encouraging the model to discover effective tool usage patterns without complex reward shaping. Implementation details include masking interpreter feedback from the loss calculation for training stability and using KV-cache reuse for memory efficiency during rollouts. An asynchronous code sandbox environment is employed for parallel execution and faster training.
ReTool was evaluated on the challenging MATH Olympiad benchmarks AIME2024 and AIME2025. Using Qwen2.5-32B-Instruct as the base model, ReTool achieved 67.0% accuracy on AIME2024 with just 400 training steps, significantly outperforming a text-based RL baseline (40.0% accuracy with over 1000 steps). When trained on a stronger backbone, DeepSeek-R1-Distill-Qwen-32B, ReTool reached 72.5% accuracy on AIME2024, surpassing competitive models like OpenAI o1-preview (44.6%) and s1-32B (56.7%). The cold-start model achieved performance similar to the text-based RL baseline, confirming its effectiveness as an initial step.
A comprehensive analysis of the model's behavior during RL training revealed several key insights:
- Efficiency: The average response length decreased by approximately 40% after RL training, indicating that replacing lengthy textual calculations with code improves token efficiency.
- Code Proficiency: Metrics like code ratio (proportion of responses containing code) and average code lines increased significantly, showing enhanced code utilization and complexity.
- Improved Tool Use: The total correct code counts on the test set showed an upward trend, reflecting better execution of generated code. The timing of code invocation shifted earlier in the reasoning process, suggesting the model learned optimal points for tool intervention.
- Emergent Behaviors: The model demonstrated emergent capabilities such as code self-correction, identifying and fixing errors (like missing imports) based on interpreter feedback, even without explicit training data for this behavior.
- Diverse Strategies: Analysis of code purpose showed increased diversity after RL training, indicating adaptive tool selection and improved generalizability across different types of problems.
In summary, ReTool successfully integrates reinforcement learning with code interpreter execution to enable LLMs to develop strategic and efficient tool-use capabilities for complex mathematical reasoning, achieving state-of-the-art performance on challenging benchmarks and demonstrating promising emergent behaviors.