Acting Less is Reasoning More! Teaching Model to Act Efficiently (2504.14870v2)

Published 21 Apr 2025 in cs.AI and cs.CL

Abstract: Tool-integrated reasoning (TIR) augments LLMs with the ability to invoke external tools during long-form reasoning, such as search engines and code interpreters, to solve tasks beyond the capabilities of internal reasoning. While reinforcement learning (RL) has shown promise in training such agents, most of existing approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use. This often leads to excessive tool calling, incurring high computational costs and hindering the development of internal reasoning capabilities - a phenomenon known as \textit{cognitive offloading}. To this end, we propose Optimal Tool Call-controlled Policy Optimization (OTC-PO), a simple yet effective RL-based framework that encourages models to produce accurate answers with minimal tool calls. Our method introduces a tool-integrated reward that jointly considers answer correctness and corresponding tool use behavior of model to reach that answer. To validate the effectiveness, we introduce the metric of \textit{tool productivity}, defined as the ratio between the number of correct answers and the total number of tool calls across all test cases. This metric reflects how efficiently tool usage contributes to successful task completion, with higher values indicating smarter and more autonomous reasoning. We instantiate this framework within both Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 and Qwen-Math across multiple QA benchmarks show that our approach reduces tool calls by up to 68.3\% and improves tool productivity by up to 215.4\%, while maintaining comparable answer accuracy.

Summary

Optimizing Tool Calls in AI with Reinforcement Learning

The research paper "OTC: Optimal Tool Calls via Reinforcement Learning" outlines a novel approach to improving the efficiency and accuracy of tool-integrated reasoning (TIR) in LLMs. The proposed method, called Optimal Tool Call-controlled Policy Optimization (OTC-PO), incorporates reinforcement learning (RL) techniques to enhance reasoning processes in LLMs by managing the invocation of external tools like search engines and code interpreters.

Overview

TIR significantly expands the capabilities of LLMs by allowing them to interact with external tools, filling gaps in tasks that pure language processing cannot solve. However, previous approaches largely ignored the efficiency and cost concerns related to tool usage, leading to either excessive tool calls that increase overhead or insufficient use that compromises answer quality. OTC-PO addresses these challenges by integrating a reward mechanism in the RL framework that considers both correctness and efficiency of tool use.

The paper instantiates its framework within Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), yielding OTC-PPO and OTC-GRPO, respectively. By prioritizing tool efficiency, the approach aims to optimize the reasoning trajectory of LLMs, balancing the use of LLMs' internal reasoning with external tool calls. Importantly, the optimal number of tool calls is context-dependent, varying with different questions and models, requiring models to adapt to dynamically changing needs.

Numerical Results

The methodology was tested on Qwen-2.5 and Qwen-Math models across various question-answering (QA) benchmarks. Results demonstrate a substantial reduction in tool calls—up to 73.1%—and an increase in tool productivity—up to 229.4%—while preserving answer accuracy. Notably, the framework achieved these improvements without compromising the correctness of the final answers. This RL-based framework marks the first explicit attempt to optimize tool-use efficiency in TIR.

Implications

The integration of OTC-PO into LLMs presents several theoretical and practical implications. Theoretically, it extends the RL framework in AI to cover complex reasoning tasks, providing a more granular and contextually aware reward system. Practically, it positions LLMs as more capable and cost-efficient agents. By achieving accurate and efficient reasoning, the adoption of OTC-PO can optimize the deployment of LLMs in computationally intensive environments where resource utilization is a critical factor.

Future Directions

As TIR becomes more prevalent, continued research could focus on refining reward shaping techniques to further align RL objectives with diverse real-world contexts. Additionally, exploring the scalability of this framework with increasingly complex tool chains and more extensive datasets will be key to understanding its full potential. Furthermore, incorporating more sophisticated cognitive elements into the framework, such as meta-reasoning and self-awareness, could lead to even more efficient AI reasoning processes, enhancing LLMs' decision-making capabilities. As models grow in complexity and capability, balancing efficiency in tool usage with the internal cognitive abilities of LLMs will be a crucial area of focus for future research in AI reasoning.

Related Papers

Tweets

https://twitter.com/WangCarrey/status/1914709423106179519

https://twitter.com/fly51fly/status/1916247996507033773

https://twitter.com/hengjinlp/status/1922031557100863811

https://twitter.com/HuggingPapers/status/1916167371569381770

https://twitter.com/arxivsanitybot/status/1914875356441440749

https://twitter.com/GptMaestro/status/1919525908249248113

YouTube

Show All Videos