Token-Budget-Aware LLM Reasoning

Published 24 Dec 2024 in cs.CL, cs.AI, and cs.LG | (2412.18547v5)

Abstract: Reasoning is critical for LLMs to excel in a wide range of tasks. While methods like Chain-of-Thought (CoT) reasoning and enhance LLM performance by decomposing problems into intermediate steps, they also incur significant overhead in token usage, leading to increased costs. We find that the reasoning process of current LLMs is unnecessarily lengthy and it can be compressed by including a reasonable token budget in the prompt, but the choice of token budget plays a crucial role in the actual compression effectiveness. We then propose a token-budget-aware LLM reasoning framework that dynamically adjusts the number of reasoning tokens based on the reasoning complexity of each problem. Experiments show that our method effectively reduces token costs in CoT reasoning with only a slight performance reduction, offering a practical solution to balance efficiency and accuracy in LLM reasoning. Code: https://github.com/GeniusHTX/TALE

Abstract PDF Upgrade to Chat

Summary

The paper introduces TALE, a novel framework that dynamically estimates token budgets to optimize Chain-of-Thought reasoning in LLMs.
It employs a zero-shot estimator and binary search to balance token costs and accuracy, achieving a 68.64% reduction in token usage.
TALE demonstrates versatility across various LLM architectures, ensuring efficient and cost-effective performance in real-world applications.

Token-Budget-Aware LLM Reasoning

This paper introduces a framework called TALE that addresses token redundancy in Chain-of-Thought (CoT) reasoning for LLMs. The key contribution of this research is the dynamic estimation of token budgets that guide the reasoning process in LLMs, thereby balancing efficiency and accuracy.

Introduction to Token-Budget-Aware Framework

The TALE framework leverages token budget estimation to optimize the CoT reasoning process in LLMs. The central premise is that the reasoning processes of current LLMs are unnecessarily lengthy, leading to substantial overhead in token usage. By incorporating token budget constraints into the prompts, TALE aims to reduce token costs while maintaining high performance. This is achieved through dynamic estimation of token budgets based on the complexity of reasoning tasks.

Methodology

Workflow of TALE

TALE consists of two main components: budget estimation and prompt construction. The process begins by estimating a token budget for a given question, after which a token-budget-aware prompt is crafted. This prompt combines the question with the estimated budget, guiding the LLM to generate a concise yet accurate answer.

Figure 1: The workflow of TALE. Given a question, TALE first estimates the token budget using a budget estimator. It then crafts a token-budget-aware prompt by combining the question with the estimated budget. Finally, the prompt is input to the LLM to generate the answer as the final output.

Budget Estimation Techniques

TALE utilizes a zero-shot estimator mechanism, where the reasoning LLM itself estimates the number of tokens required for an answer. This approach mirrors human reasoning, where the effort required to solve a problem can be estimated at a glance. For precise budget estimation, regression models and internalizing budget constraints during LLM fine-tuning are also explored.

Optimal Token Budget Search

To achieve optimal budget estimation, a binary search method identifies the minimal budget needed to maintain accuracy while minimizing token costs. This process reveals a phenomenon termed "Token Elasticity," where exceedingly small token budgets result in higher token consumption due to the LLM's inability to follow the specified budget constraints.

Evaluation

Effectiveness of TALE

TALE was evaluated across several datasets, demonstrating a notable reduction in token costs. On average, TALE achieved a 68.64% reduction in token usage with less than a 5% decrease in accuracy compared to vanilla CoT. This efficiency in token usage translates directly to lower computational costs, making TALE a compelling choice for scenarios requiring budget-conscious reasoning.

Figure 2: The effects of optimal searched budget. CoT with our optimal searched budget reduces the token costs significantly without influencing the accuracy.

Generalization Across LLMs

TALE's ability to generalize across different LLM architectures was also analyzed. The framework maintained its cost-efficiency while delivering high performance across models like Yi-lightning and GPT-4 variants. This adaptability showcases TALE's potential for diverse applications in real-world LLM deployments.

Conclusion

In conclusion, TALE introduces a novel approach to managing token costs in CoT reasoning by incorporating budget awareness into the prompting process. This method not only preserves accuracy but also significantly reduces computational expenses. Future developments may involve refining budget estimation techniques and exploring broader applications of budget-aware reasoning in LLMs.

Markdown