Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 35 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 93 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 229 tok/s Pro
2000 character limit reached

AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning (2507.21836v1)

Published 29 Jul 2025 in cs.CL

Abstract: LLMs, when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs). Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies. AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration. Extensive evaluations across diverse knowledge-intensive, mathematical, and general LLMing tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior. These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs. The code and data are available at https://github.com/weiyifan1023/AutoTIR.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel RL-based framework that enables LLMs to autonomously decide when and which external tools to invoke, effectively balancing intrinsic capabilities with augmented reasoning.
  • It employs a dual reward mechanism—action and output rewards via Group Relative Policy Optimization—to optimize both tool-use efficiency and final answer accuracy.
  • Evaluation demonstrates significant gains in tool selection accuracy and reasoning performance across diverse tasks, validating the framework's scalability and practical impact.

AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning

Introduction

AutoTIR introduces an autonomous framework for enhancing reasoning in LLMs by integrating external tools through reinforcement learning. This methodology diverges from the conventional static tool-use templates, providing a dynamic system that judiciously selects tools based on task demands, thereby maintaining the core capabilities of LLMs while extending their reasoning capabilities.

Overall Framework of AutoTIR

The AutoTIR framework operates by allowing LLMs to autonomously determine when and which tools to invoke, significantly differing from prior approaches with fixed tool invocation strategies (Figure 1). This framework incorporates a hybrid reward mechanism focused on optimizing task-specific correctness and tool-use efficiency. The reward system is bifurcated into:

  1. Action Reward: This component guides the model in deciding whether a tool invocation is necessary, promoting correct tool selection while penalizing unnecessary or incorrect invocations.
  2. Output Reward: This element encourages the model to achieve high accuracy in the final output through effective integration of tool-derived results. Figure 1

    Figure 1: Overall framework of AutoTIR. Top: Comparison between AutoTIR and existing paradigms (fixed reasoning strategy vs.\ autonomous decision). Bottom: GRPO training pipeline that incorporates multiple reasoning actions.

Methodology

AutoTIR employs a reinforcement learning approach where LLMs explore tool-use strategies across diverse tasks. The action space involves deciding on tool invocation dynamically, adapting to the complexity of each task. This flexibility enables a balance between maintaining core linguistic capabilities and leveraging tools for enhanced reasoning.

The RL agent is trained via Group Relative Policy Optimization (GRPO), optimizing decisions by evaluating rewards derived from both tool effectiveness and final answer accuracy. This strategy enables AutoTIR to generalize across multiple task domains, outperforming static tool-use methods.

Evaluation and Performance Metrics

AutoTIR has been evaluated across various datasets, demonstrating significant improvements over baselines in both knowledge-intensive and mathematical domains. Metrics include Exact Match (EM) for QA tasks, standard Accuracy for logical reasoning, and Soft Accuracy (SAcc) for instruction adherence tasks. The results highlight AutoTIR's ability to effectively utilize tools when beneficial, without compromising fundamental language skills, as shown by maintaining high SAcc scores.

Efficiency and Tool Selection

In dissecting tool utilization efficiency, AutoTIR exhibits superior performance in both tool selection accuracy (TS) and tool productivity (TP), particularly in complex reasoning tasks. This efficacy stems from its ability to minimize unnecessary tool usage while maximizing the contribution of invoked tools to problem solving.

(Table 1 and Figure 2)

Figure 2: Model Performance and Tool Advantage Across Reasoning Task Types.

Scalability and Training Dynamics

The training process demonstrates a consistent increase in both action and output rewards, indicating the model's growing proficiency in integrating tools and generating solutions. As training progresses, the response length increases, suggesting more complex reasoning trajectories are being explored (Figure 3). Figure 3

Figure 3: Avg. reward score and response length during training.

Conclusion

AutoTIR represents a substantial advancement in tool-integrated reasoning for LLMs, offering a flexible, adaptive strategy that respects the model's intrinsic capabilities while enhancing its reasoning power. This method not only improves performance across diverse tasks but also lays the groundwork for future developments in scalable, adaptive AI systems capable of sophisticated, real-time problem-solving across a wide array of domains. By utilizing reinforcement learning, AutoTIR ensures that tool-use strategies evolve organically, optimizing both efficiency and effectiveness in reasoning tasks.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube