START: Self-taught Reasoner with Tools (2503.04625v2)

Published 6 Mar 2025 in cs.CL

Abstract: Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.

Summary

The paper introduces START, a self-learning framework that integrates external tools into large reasoning models to enhance capabilities and mitigate issues like hallucinations.
START employs two techniques, Hint-infer for cueing tool use during inference and Hint-RFT for fine-tuning based on scored and filtered tool-augmented reasoning trajectories.
Evaluations show START significantly improves performance on benchmarks like GPQA (63.6%) and AMC23 (95.0%) compared to base models, demonstrating its effectiveness.

The paper "START: Self-taught Reasoner with Tools" introduces START, a novel framework aimed at enhancing the reasoning capabilities of large reasoning models (LRMs) by integrating external tools, addressing common issues such as hallucinations and inefficiencies. START introduces a self-learning framework that comprises two key techniques: Hint-infer and Hint Rejection Sampling Fine-Tuning (Hint-RFT).

Hint-infer Technique:

Hint-infer involves the insertion of artificially designed hints during the inference process of an LRM, stimulating its ability to utilize external tools. For example, hints like "Wait, maybe using Python here is a good idea" serve as cues for the model to invoke tool-supported operations.
This technique does not require demonstration data and offers a simple method for sequential test-time scaling.

Hint-RFT Technique:

Hint-RFT combines the principles of Hint-infer with rejection sampling and fine-tuning. It involves scoring, filtering, and modifying reasoning trajectories generated by an LRM with tool invocations, followed by fine-tuning based on these enhanced trajectories.

START's integration with external tools allows for complex computations, self-checking, exploration of diverse methods, and self-debugging, effectively overcoming limitations of traditional LRMs that rely solely on internal reasoning processes.

Performance Evaluation:

The START framework was implemented on the QwQ-32B model and tested on several benchmarks including PhD-level science QA (GPQA), AMC23, AIME24, AIME25 for math competitions, and LiveCodeBench for code benchmarks.
START achieved accuracy rates of 63.6% on GPQA, 95.0% on AMC23, 66.7% on AIME24, 47.1% on AIME25, and 47.3% on LiveCodeBench.
These results demonstrate START's significant performance improvement over the base QwQ-32B model and its competitive standing relative to state-of-the-art models like R1-Distill-Qwen-32B and o1-Preview.

In summary, the paper highlights START's effectiveness in incorporating external tools for enhancing reasoning capabilities and provides a methodological advancement in the field of tool-integrated LLMs.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1897854233174532140

https://twitter.com/_philschmid/status/1901176464919179299

https://twitter.com/coding_guille/status/1898865811248267659

https://twitter.com/fly51fly/status/1898852990800261423

https://twitter.com/rohanpaul_ai/status/1898361832844509433

https://twitter.com/arxivsanitybot/status/1898193138713248098

YouTube

Show All Videos