Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs (2501.18585v2)

Published 30 Jan 2025 in cs.CL

Abstract: LLMs such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.

Summary

The paper defines underthinking as o1-like models prematurely abandoning promising reasoning steps, leading to significantly inefficient token use.
It introduces a novel metric showing that incorrect responses can use up to 225% more tokens and exhibit a 418% increase in thought-switching.
The study proposes a decoding strategy with a thought-switching penalty, demonstrating improved accuracy on challenging math and science datasets.

The paper provides an in‐depth analysis of a reasoning inefficiency in LLMs that emulate deep chain-of-thought processes, termed “underthinking.” The authors investigate o1-like models—which include architectures such as QwQ-32B-Preview and DeepSeek-R1-671B—and identify that these models frequently switch reasoning strategies prematurely. This behavior manifests as an early abandonment of promising reasoning paths even when initial thoughts are correct, leading to extended generations that do not converge to a correct final answer.

The paper’s key technical contributions can be summarized as follows:

Characterization of Underthinking:

The paper formally defines underthinking as the phenomenon where o1-like models, faced with challenging math and science problems (as demonstrated on datasets such as MATH500, GPQA Diamond, and AIME), frequently switch between reasoning thoughts without sufficiently developing any one line of thought. Quantitative analysis demonstrates that, for incorrect responses, the models can use up to 225% more tokens and switch thoughts 418% more frequently compared to correct responses. Early-stage thoughts in incorrect responses often contain correct reasoning elements that are not fully explored.

Underthinking Metric:

A novel metric is introduced to measure token efficiency in incorrect responses. Denoted as

$\xi_{UT} = \frac{1}{N} \sum_{i=1}^N \left(1 - \frac{\hat{T}_i}{T_i}\right)$

where: - $N$ is the number of incorrect responses, - $T_i$ is the total token count of the $i^{th}$ response, and - $\hat{T}_i$ is the token count up to and including the first correct reasoning thought. A higher $\xi_{UT}$ indicates that a larger share of the generated sequence fails to contribute to a correct solution, thereby quantifying inefficient generation due to underthinking.

Analysis of Thought-Switching Behavior:

By segmenting responses into individual reasoning thoughts, the paper shows that incorrect answers exhibit a significantly higher frequency of thought switching. Even when initial thoughts are correct (as confirmed by auxiliary assessments using distilled versions of DeepSeek-R1-671B), these thoughts are often abandoned in favor of additional, unproductive reasoning steps. Detailed breakdowns are provided in figures that compare the distribution of thought correctness across indices in incorrect responses, highlighting the premature transitions observed.

Decoding Strategy with Thought Switching Penalty (Tip):

To mitigate underthinking, the authors propose modifying the standard softmax-based decoding process by applying a penalty to tokens associated with thought transitions. Letting $\widehat{V}$ represent the set of tokens indicating a thought switch (e.g., “alternatively”), the logits are adjusted as:

$\hat{z}_{t,v} = \begin{cases} z_{t,v} - \alpha, & \text{if } v \in \widehat{V} \text{ and } t < \Psi + \beta, \ z_{t,v}, & \text{otherwise,} \end{cases}$

where $\alpha$ (penalty strength) and $\beta$ (penalty duration) are hyperparameters that control the discouragement of premature switching. This modification encourages the model to elaborate on its current thought rather than switching early. A grid search over $\alpha$ and $\beta$ shows that modest decay (for example, $\alpha = 3$ and $\beta = 600$ ) results in improved accuracy across challenging test sets.

Empirical Validation:

Experimental results indicate that, with the Tip strategy, there is a measurable improvement in accuracy on datasets such as MATH500-Hard, GPQA Diamond, and AIME. For instance, on the AIME dataset, the application of Tip leads to an improvement from 41.7% to 45.8% in Pass@1 accuracy while also reducing the underthinking score. This supports the interpretation that guiding the model to commit to and deepen promising reasoning steps can lead to more reliable and accurate problem solving.

Overall, the paper contributes a rigorous empirical framework and a practical decoding modification to address and quantify underthinking in o1-like LLMs. By balancing token efficiency with reasoning depth, the work offers a method to enhance problem-solving capabilities without the need for additional model fine-tuning.