Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoRT: Code-integrated Reasoning within Thinking (2506.09820v2)

Published 11 Jun 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural LLMs. The models and code are available at https://github.com/ChengpengLi1003/CoRT.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Chengpeng Li (10 papers)
  2. Zhengyang Tang (13 papers)
  3. Ziniu Li (24 papers)
  4. Mingfeng Xue (10 papers)
  5. Keqin Bao (21 papers)
  6. Tian Ding (20 papers)
  7. Ruoyu Sun (70 papers)
  8. Benyou Wang (109 papers)
  9. Xiang Wang (279 papers)
  10. Junyang Lin (99 papers)
  11. Dayiheng Liu (75 papers)

Summary

The paper "CoRT: Code-integrated Reasoning within Thinking" (Li et al., 11 Jun 2025 ) introduces a post-training framework designed to enhance the ability of Large Reasoning Models (LRMs) to effectively and efficiently integrate Code Interpreters (CIs) for complex tasks, particularly mathematical reasoning. The core challenge addressed is the inefficiency and occasional inaccuracy of LRMs when performing precise computations or solving complex equations, functionalities better suited for external tools like CIs. Direct integration of CIs can be inefficient because LRMs, primarily trained on natural language, do not inherently understand how to best leverage external computational power.

The CoRT framework tackles this by modeling the reasoning process as a sequence of natural language steps, program snippets, and execution outputs: τt=τt1ntptot\tau_t = \tau_{t-1} \oplus n_t \oplus p_t \oplus o_t. This iterative loop, where the model π\pi takes the problem PP and current trace τt1\tau_{t-1} to generate text ntn_t and program ptp_t, which is then executed by environment E\mathcal{E} to produce oto_t, forms the practical basis for integrating computation.

A key aspect of CoRT is addressing data scarcity for training LRM-CI interaction. The paper explores two methods for creating training data instances:

  1. Prompt-hint: This initial approach involved adding a simple hint like "Okay, let's try to solve this problem step by step using multiple python code calls" after the model's thinking token >. While this significantly increased the rate of code generation (from 50% to 90%), models trained with this approach exhibited inefficiencies such as delayed code computation (doing calculations in text first) and code result distrust (verifying CI outputs manually). > > 2. Hint-engineering: To overcome the inefficiencies of prompt-hint, this method involves strategically inserting different, more targeted hints at specific points in the reasoning process. For instance, a hint might be inserted when the model begins a tedious manual calculation ("It looks tedious, and we can use python code to simplify the reasoning.The paper "CoRT: Code-integrated Reasoning within Thinking" (Li et al., 11 Jun 2025 ) introduces a post-training framework designed to enhance the ability of Large Reasoning Models (LRMs) to effectively and efficiently integrate Code Interpreters (CIs) for complex tasks, particularly mathematical reasoning. The core challenge addressed is the inefficiency and occasional inaccuracy of LRMs when performing precise computations or solving complex equations, functionalities better suited for external tools like CIs. Direct integration of CIs can be inefficient because LRMs, primarily trained on natural language, do not inherently understand how to best leverage external computational power. > > The CoRT framework tackles this by modeling the reasoning process as a sequence of natural language steps, program snippets, and execution outputs: τt=τt1ntptot\tau_t = \tau_{t-1} \oplus n_t \oplus p_t \oplus o_t. This iterative loop, where the model π\pi takes the problem PP and current trace τt1\tau_{t-1} to generate text ntn_t and program ptp_t, which is then executed by environment E\mathcal{E} to produce oto_t, forms the practical basis for integrating computation. > > A key aspect of CoRT is addressing data scarcity for training LRM-CI interaction. The paper explores two methods for creating training data instances: > > 1. Prompt-hint: This initial approach involved adding a simple hint like "Okay, let's try to solve this problem step by step using multiple python code calls" after the model's thinking token <think>. While this significantly increased the rate of code generation (from 50% to 90%), models trained with this approach exhibited inefficiencies such as delayed code computation (doing calculations in text first) and code result distrust (verifying CI outputs manually). > > 2. Hint-engineering: To overcome the inefficiencies of prompt-hint, this method involves strategically inserting different, more targeted hints at specific points in the reasoning process. For instance, a hint might be inserted when the model begins a tedious manual calculation ("The paper "CoRT: Code-integrated Reasoning within Thinking" (Li et al., 11 Jun 2025 ) introduces a post-training framework designed to enhance the ability of Large Reasoning Models (LRMs) to effectively and efficiently integrate Code Interpreters (CIs) for complex tasks, particularly mathematical reasoning. The core challenge addressed is the inefficiency and occasional inaccuracy of LRMs when performing precise computations or solving complex equations, functionalities better suited for external tools like CIs. Direct integration of CIs can be inefficient because LRMs, primarily trained on natural language, do not inherently understand how to best leverage external computational power. > > The CoRT framework tackles this by modeling the reasoning process as a sequence of natural language steps, program snippets, and execution outputs: τt=τt1ntptot\tau_t = \tau_{t-1} \oplus n_t \oplus p_t \oplus o_t. This iterative loop, where the model π\pi takes the problem PP and current trace τt1\tau_{t-1} to generate text ntn_t and program ptp_t, which is then executed by environment E\mathcal{E} to produce oto_t, forms the practical basis for integrating computation. > > A key aspect of CoRT is addressing data scarcity for training LRM-CI interaction. The paper explores two methods for creating training data instances: > > 1. Prompt-hint: This initial approach involved adding a simple hint like "Okay, let's try to solve this problem step by step using multiple python code calls" after the model's thinking token <think>. While this significantly increased the rate of code generation (from 50% to 90%), models trained with this approach exhibited inefficiencies such as delayed code computation (doing calculations in text first) and code result distrust (verifying CI outputs manually). > > 2. Hint-engineering: To overcome the inefficiencies of prompt-hint, this method involves strategically inserting different, more targeted hints at specific points in the reasoning process. For instance, a hint might be inserted when the model begins a tedious manual calculation ("The paper "CoRT: Code-integrated Reasoning within Thinking" (Li et al., 11 Jun 2025 ) introduces a post-training framework designed to enhance the ability of Large Reasoning Models (LRMs) to effectively and efficiently integrate Code Interpreters (CIs) for complex tasks, particularly mathematical reasoning. The core challenge addressed is the inefficiency and occasional inaccuracy of LRMs when performing precise computations or solving complex equations, functionalities better suited for external tools like CIs. Direct integration of CIs can be inefficient because LRMs, primarily trained on natural language, do not inherently understand how to best leverage external computational power. > > The CoRT framework tackles this by modeling the reasoning process as a sequence of natural language steps, program snippets, and execution outputs: τt=τt1ntptot\tau_t = \tau_{t-1} \oplus n_t \oplus p_t \oplus o_t. This iterative loop, where the model π\pi takes the problem PP and current trace τt1\tau_{t-1} to generate text ntn_t and program ptp_t, which is then executed by environment E\mathcal{E} to produce oto_t, forms the practical basis for integrating computation. > > A key aspect of CoRT is addressing data scarcity for training LRM-CI interaction. The paper explores two methods for creating training data instances: > > 1. Prompt-hint: This initial approach involved adding a simple hint like "Okay, let's try to solve this problem step by step using multiple python code calls" after the model's thinking token <think>. While this significantly increased the rate of code generation (from 50% to 90%), models trained with this approach exhibited inefficiencies such as delayed code computation (doing calculations in text first) and code result distrust (verifying CI outputs manually). > > 2. Hint-engineering: To overcome the inefficiencies of prompt-hint, this method involves strategically inserting different, more targeted hints at specific points in the reasoning process. For instance, a hint might be inserted when the model begins a tedious manual calculation ("python"), or when it questions a CI output ("We don't need to doubt the accuracy of python calculations."). This approach aims to teach the model when and how to efficiently delegate tasks to the CI. The paper emphasizes the effectiveness of this approach even with a small dataset (30 manually created high-quality samples), supporting the "less is more" data quality principle. > > Based on these data generation strategies, CoRT employs an efficient and scalable training pipeline involving Supervised Fine-Tuning (SFT), Rejection Fine-Tuning (RFT), and Reinforcement Learning (RL). > > * SFT trains models on the generated prompt-hint or hint-engineering data. > > * RFT is applied to 32B models, filtering out trajectories with incorrect answers or inefficient code usage patterns from a larger dataset (like STILL3) to further refine the model's behavior. > > * RL is successfully applied to smaller (1.5B) models after strong-to-weak distillation from 32B models, leveraging the GRPO algorithm with specific modifications for code integration. > > Implementing RL for code-integrated reasoning involves several practical considerations: > > * Rollout with Code Interpreter: The RL environment allows for multiple interactions between the model and the CI within a single reasoning trajectory, limited by a maximum tool usage TT. > > * Persistent Execution Environment: A Jupyter-like environment is used where the state (variables, functions) persists across different code blocks within a single problem-solving attempt, improving code efficiency and reducing errors. > > * Output Masking: This is a crucial technique applied to CI execution results during training to enhance stability and prevent model collapse. > > * Reward Design: A dual reward system is used, combining an accuracy reward (RaR_a) based on the final answer (extracted via rule-based verification from a specified format like boxed{}) and a code execution penalty (RcR_c) for failed code calls. The total reward is R=Ra+ωRcR = R_a + \omega R_c, where ω\omega is a weighting factor. Experiments showed that a small penalty (ω=0.1\omega=0.1) was more effective than a larger one (ω=0.5\omega=0.5), suggesting that a modest encouragement for code correctness is better than heavily penalizing experimental code. > > The practical impact of CoRT is demonstrated through comprehensive evaluations on five challenging mathematical reasoning datasets (AIME24, AIME25, AMC23, MATH500, OlympiadBench). > > * Hint-Engineering models show significant absolute accuracy improvements (4% for 32B, 8% for 1.5B) compared to baseline and prompt-hint models. > > * A major contribution is the improvement in token efficiency. Hint-Engineering models use substantially fewer tokens per problem than prompt-hint or standard CoT models (30% less for 32B, 50% less for 1.5B on AIME benchmarks). This is critical for reducing inference costs and latency in real-world applications. The analysis shows that Hint-Engineering is more efficient for both correct and incorrect responses. > > * Analysis of code behavior reveals that Hint-Engineering encourages a more balanced use of code for direct calculation compared to prompt-hint, which leans towards verification. This optimized usage pattern contributes to the improved efficiency. > > * The RL stage, particularly for 1.5B models, significantly boosts Pass@k performance, meaning the models are more likely to solve problems with a limited number of attempts, further enhancing practical inference efficiency. Ablation studies show that training on harder queries, while taking longer, leads to better final performance in RL, and the code reward is crucial for achieving optimal results. > > Implementation details provided include training on 4 servers with 8 A100 GPUs each for SFT/RFT/RL, evaluating on single servers with 8 A100s, using a maximum sequence length of 32,768 tokens during inference, and limiting tool usage to 15 calls with a system message indicating when the limit is reached. The distillation process for 1.5B models used a curated dataset of 10k problems from various sources. > > In summary, CoRT provides a practical framework for building more efficient and capable tool-integrated LRMs for mathematical reasoning. By focusing on high-quality data synthesis via strategic hinting, leveraging RFT for behavioral refinement, and employing tailored RL with appropriate reward design and environment features, the framework achieves notable performance gains while significantly reducing computational overhead in terms of token usage, making tool-integrated reasoning more viable for practical deployment.
Github Logo Streamline Icon: https://streamlinehq.com