Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs (2412.21187v2)

Published 30 Dec 2024 in cs.CL

Abstract: The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.

Summary

The paper quantifies the overthinking phenomenon in o1-like LLMs by introducing novel efficiency metrics that measure redundant computation after the first correct answer.
It employs MDL-based preference optimization techniques, including SFT, DPO, RPO, and SimPO, to train models to generate concise and correct reasoning paths.
Experiments reveal significant token reductions—over 50% in some cases—while largely maintaining or slightly improving reasoning accuracy on benchmark datasets.

This paper investigates the phenomenon of "overthinking" in LLMs exhibiting scaled inference-time computation, exemplified by systems analogous to OpenAI's O1 (termed "o1-like" models) (2412.21187). While these models achieve strong performance on complex reasoning tasks via extended Chain-of-Thought (CoT) generation and exploration of multiple solution strategies, they often allocate excessive computational resources to simple problems, resulting in significant inefficiency. The research introduces metrics to quantify this inefficiency and proposes mitigation strategies rooted in the Minimum Description Length (MDL) principle to enhance computational resource calibration during inference.

Problem Definition and Quantification of Overthinking

The core issue addressed is the tendency of o1-like models, such as QwQ-32B-Preview and DeepSeek-R1-Preview, to generate excessively long responses containing multiple, often redundant, solution attempts, irrespective of the input problem's intrinsic difficulty. This contrasts with conventional LLMs that typically produce a single, direct solution path. Empirical analysis across mathematical reasoning benchmarks (ASDIV, GSM8K, MATH500) reveals that these models generate, on average, 3.2-3.6 distinct solution attempts per query (Figure 3 in the paper), consuming substantially more tokens (~700-2400 tokens on average) compared to baselines like Llama-3.3-70B-Instruct (~150-600 tokens) (Table 1).

To formally quantify this inefficiency, two novel metrics suites are introduced:

Outcome Efficiency ( $\xi_O$ , $\xi_O^+$ ): These metrics assess the computational cost incurred after the first correct answer is generated within a response. $\xi_O$ is defined as the ratio of tokens in the shortest prefix yielding the first correct answer to the total tokens generated (Eq 1). $\xi_O^+$ is a stricter variant considering only valid responses (Eq 2). Low outcome efficiency indicates significant computation spent on redundant verification or exploration post-solution.
Process Efficiency ( $\xi_P$ , $\xi_P^+$ ): These metrics evaluate the diversity and conciseness of the reasoning process itself. They measure the ratio of tokens contributing to distinct solution perspectives relative to the total tokens generated (Eq 3, 4). Low process efficiency suggests repetitive or non-contributory reasoning steps within the generated CoT.

Calculations show o1-like models exhibit significantly lower efficiency scores compared to standard models. For instance, QwQ-32B-Preview displayed outcome efficiencies ( $\xi_O$ ) between 41-52% and process efficiencies ( $\xi_P$ ) between 66-72% on the tested datasets (Table 1), quantitatively confirming the overthinking behavior. Further analysis indicated diminishing returns, where later solution attempts provided marginal accuracy gains (Figure 4) and exhibited reduced diversity (Figure 5).

Mitigation Strategy via Minimum Description Length Principle

The proposed mitigation strategy draws inspiration from Occam's Razor and the Minimum Description Length (MDL) principle, suggesting that the most efficient (shortest) correct reasoning path should be preferred. The objective is formulated as finding the rationale $r$ that minimizes its length $||r||$ subject to correctness, given an input $x$ and model parameters $\theta$ :

$\hat{r} = \arg\min_{r} \{ ||r|| \,|\, r \sim p(\cdot|x; \theta) \} \quad \text{s.t. correctness}(r, \text{answer}(x))$

To operationalize this, a self-training paradigm is employed. The base o1-like model (QwQ-32B-Preview) generates multiple (K=10) candidate responses for problems from a suitable dataset (PRM12K). These responses are then used to construct training data for preference optimization, aiming to teach the model to favor efficient reasoning.

Two primary strategies for constructing preference pairs ( $r_{preferred}, r_{rejected}$ ) were explored:

Response-Level Contrast: Pairs are formed by contrasting the shortest correct response generated for a problem against the longest correct response for the same problem. Statistical analysis of generated responses justified this approach (Figures 6, 7).
Solution-Level Contrast: This approach, implemented via SimPO $_{Solution}$ , likely involves contrasting concise, correct individual solutions within a response against less efficient solutions or potentially the entire lengthy response containing them, encouraging optimality at a finer granularity.

These preference pairs are then used to fine-tune the base model using various techniques:

Supervised Fine-Tuning (SFT): Training directly on the preferred (shortest correct) responses.
Preference Optimization Algorithms: Utilizing methods like Direct Preference Optimization (DPO), Rank-based Preference Optimization (RPO), and Simple Preference Optimization (SimPO) with the constructed contrastive pairs.

Experimental Evaluation and Results

Experiments were conducted using QwQ-32B-Preview as the base model, fine-tuned with the proposed methods, and evaluated on ASDIV, GSM8K, MATH500, AIME, and GPQA datasets.

Key findings include:

Efficiency Improvements: Preference optimization methods, particularly SimPO, significantly reduced computational overhead. SimPO $_{Solution}$ demonstrated the most substantial reductions, decreasing token counts by over 50% and average solution rounds from 3.2 to 1.1 on GSM8K (Tables 2, 3). This effectively shifted the model's behavior towards generating single, concise solutions similar to conventional LLMs, while largely preserving accuracy on easier datasets.
Accuracy Maintenance: Most methods maintained or slightly improved accuracy compared to the baseline QwQ-32B-Preview, especially on simpler datasets (ASDIV, GSM8K). For instance, DPO and RPO methods applied at the response level slightly improved accuracy on GSM8K and MATH500 while reducing token counts (Table 2).
Trade-offs on Complex Tasks: The highly efficient SimPO $_{Solution}$ method exhibited a slight degradation in accuracy on the most challenging datasets (MATH500, AIME), suggesting a potential trade-off where excessive pruning of the search space might hinder performance on problems genuinely requiring extensive exploration.
Reduction in Redundant Computation: The methods successfully reduced "Additional Response" computation (tokens/rounds generated after the first correct answer), indicating improved calibration between computational effort and task completion (Tables 2, 3).
State-of-the-Art Claims: The paper reports achieving state-of-the-art performance on MATH500 (value placeholder xx% used) and improved results on GPQA and AIME relative to other open-source models (Table 5), attributing this partly to the mitigation of overthinking and associated phenomena like inversion errors (correct answers being erroneously revised).

Implications for Computational Efficiency

The research carries significant implications for the practical deployment and operational cost of advanced reasoning LLMs:

Reduced Inference Costs: Mitigating overthinking directly translates to lower computational expenditure per query, as fewer tokens are generated. This reduction (potentially >50% with methods like SimPO $_{Solution}$ ) lowers the operational costs associated with GPU time and energy consumption.
Lower Inference Latency: Generating fewer tokens reduces the time required for inference, improving response times and making models more suitable for interactive or real-time applications.
Increased System Throughput: By decreasing the computational load per request, serving infrastructures can handle a higher volume of concurrent requests, enhancing overall system throughput and user capacity.
Efficient Resource Allocation: Models trained to avoid overthinking on simpler tasks free up valuable computational resources. These resources can then be directed towards handling more users or allocated to tackle genuinely complex problems that benefit from extended computational budgets.
Refined Scaling Perspectives: The findings challenge the notion that simply increasing inference-time computation universally improves reasoning. It underscores the necessity of calibrated computation, aligning resource allocation with task complexity, potentially influencing future designs for adaptive computation mechanisms in LLMs.

Conclusion

This work provides a systematic analysis of the "overthinking" inefficiency observed in o1-like LLMs, characterized by excessive computational effort on simple tasks (2412.21187). By introducing novel efficiency metrics ( $\xi_O, \xi_P$ ) and leveraging the MDL principle via preference optimization techniques (SFT, DPO, RPO, SimPO) trained on self-generated contrastive data, the authors demonstrate effective mitigation strategies. These methods significantly reduce computational overhead (token count, solution rounds) while largely maintaining or even improving reasoning accuracy, particularly on less complex benchmarks, contributing towards more computationally efficient and practical deployment of advanced reasoning models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1874091786832134245

https://twitter.com/fly51fly/status/1875662775911055570

https://twitter.com/tuzhaopeng/status/1885197092148437118

https://twitter.com/tuzhaopeng/status/1886618260059381844

https://twitter.com/omarsar0/status/1874848897572753851

https://twitter.com/rohanpaul_ai/status/1881745128995631576