Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning (2505.21668v1)

Published 27 May 2025 in cs.AI, cs.CL, and cs.SC

Abstract: Despite advances in reasoning and planning of R1-like models, LLMs still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to 64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with Code Interpreter (70.9\%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yongchao Chen (18 papers)
  2. Yueying Liu (2 papers)
  3. Junwei Zhou (13 papers)
  4. Yilun Hao (12 papers)
  5. Jingquan Wang (7 papers)
  6. Yang Zhang (1129 papers)
  7. Chuchu Fan (81 papers)

Summary

R1-Code-Interpreter: Enhancing LLMs with Symbolic Code Execution

The paper "R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning" presents a significant advancement in integrating code-based reasoning within LLMs, specifically addressing the shortcomings of purely textual reasoning in handling tasks that require symbolic manipulation and computation. The authors introduce the R1-Code-Interpreter, which extends traditional text-only LLMs to autonomously generate and execute code queries during reasoning processes, leveraging both supervised fine-tuning (SFT) and reinforcement learning (RL).

Key Contributions

The paper identifies and tackles several critical challenges in enabling LLMs to switch intelligently between textual reasoning and code execution. The R1-Code-Interpreter framework is designed to discern when to rely on code rather than text, given the lack of explicit cues in input questions. The training process is structured to improve decision-making across diverse tasks, emphasizing the use of a curated set of 144 reasoning and planning tasks distributed among mathematical, spatial, logical, and optimization domains.

  1. Domain Breadth and Task Diversity: The curation of 144 tasks, supporting robust training and evaluation, highlights the importance of adapting to various task complexities and types. This diverse task set represents a comprehensive challenge for deploying code-augmented reasoning effectively.
  2. Integration of Code Executors: The paper introduces R1-Code-Interpreter as the first framework combining SFT and RL to enhance LLMs' reasoning capabilities with symbolic code execution. Findings indicate notable improvements in task-solving efficiency, where the model, R1-CI-14B, increases accuracy on test tasks by nearly 20% compared to text-only models.
  3. Self-Checking Emergent Behavior: An unexpectedly beneficial behavior observed during training was the model's capability to verify its reasoning outputs through code execution, thereby enhancing its reliability and accuracy. Such emergent behavior evidences the model's ability to self-correct autonomously.

Experimental Insights

The experimental results demonstrate significant advancements in the efficacy of LLMs equipped with code interpretation capabilities. Notably, the paper reports a substantial enhancement in accuracy, outperforming existing large models like GPT-4o on text-only reasoning benchmarks, and approaching performance levels attained by GPT-4o when code interpreters are employed.

  1. GRPO vs. PPO: In training settings, the Group Relative Policy Optimization (GRPO) algorithm was shown to outperform Proximal Policy Optimization (PPO) as a method for optimizing reasoning processes interleaved with code execution.
  2. Warm-start Advantage: The necessity of an initial SFT phase was emphasized, as it significantly underpins the integration of symbolic reasoning, enhancing cross-domain generalization, which might not be efficiently achieved with cold-start RL strategies.
  3. Task Diversity Constraint: The paper recognizes that task diversity presents a substantial challenge in RL for general-purpose code interpreters, pointing towards the need for efficient RL approaches capable of managing extensive and varied task domains.

Future Directions and Implications

The paper delineates foundational groundwork for further research in deploying code-based reasoning models. Future efforts could focus on reducing computational costs associated with RL training, exploring larger models to overcome inherent task complexity bottlenecks, and extending code execution capabilities to broader application-specific tasks.

The introduction of the R1-Code-Interpreter with robust multi-turn reasoning capabilities represents a compelling direction for enhancing AI systems' symbolic reasoning abilities. This has profound implications for future LLMs in diverse domains requiring precision and computational rigor, such as scientific computing, automated theorem proving, and program synthesis.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com