Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? (2504.06514v2)

Published 9 Apr 2025 in cs.AI, cs.CL, and cs.LG

Abstract: We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law'' but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models' responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.

Authors (4)

Chenrui Fan (9 papers)
Ming Li (787 papers)
Lichao Sun (186 papers)
Tianyi Zhou (172 papers)

Summary

Analysis of "Missing Premise Exacerbates Overthinking: Are Reasoning Models Losing Critical Thinking Skill?"

The manuscript titled "Missing Premise Exacerbates Overthinking: Are Reasoning Models Losing Critical Thinking Skill?" provides an in-depth examination of significant inefficiencies in reasoning LLMs when confronted with ill-posed questions. The authors identify a critical limitation termed as "MiP-Overthinking," where reasoning models fail to efficiently recognize unsolvable questions devoid of essential premises.

Key Findings

The paper reveals that reasoning models often generate unnecessarily lengthy responses, containing repetitive and redundant information, when faced with ill-posed questions containing missing premises. This escalation in token generation does not correspond to enhanced performance or the models abstaining efficiently from unsolvable problems. Instead, it contrasts with the expected test-time scaling law that suggests longer reasoning paths should correlate with improved conclusions.

Surprisingly, models not specifically designed for reasoning demonstrate superior adaptability and critical thinking in their ability to recognize and abstain from resolving ill-posed queries. These non-reasoning models generate more concise responses and promptly identify the absence of critical information, indicating robust performance under challenging conditions.

Methodology

The authors establish a rigorous framework by defining "Missing Premise" through which they construct datasets tailored to elicit this overthinking flaw. These include both synthetic scenarios and modifications of existing benchmarks like SVAMP, GSM8K, and MATH500. The paper methodically compares the performance of several state-of-the-art LLMs trained with different methods, including both open-source and proprietary systems.

Using metrics such as response length, accuracy on well-defined queries, and abstain rates on MiP problems, the researchers meticulously draw distinctions between reasoning and non-reasoning models. They capitalize on step-level similarity analysis and detailed word count distributions to uncover inefficiencies in the models' thinking patterns, further unveiling the lack of critical thinking abilities when faced with missing premises.

Implications and Speculations

This examination not only pinpoints the critical flaw in current reasoning models but also questions the efficacy of reinforcement learning and supervised fine-tuning approaches predominantly used in these systems. While these methodologies have been successful in extending reasoning capabilities, they evidently fall short of instilling genuine critical thinking skills necessary for discerning ill-posed questions.

The findings suggest a possible need for reassessing the training paradigms of LLMs, potentially incorporating specific parameters that guide models towards stopping reasoning when encountering unsolvable queries. This calls for innovative algorithmic strategies that prioritize clarity over verbose response generation, ensuring models can effectively identify and abstain from resolving questions with absent information.

Moreover, as AI continues to improve, the necessity of critical thinking frameworks within these models becomes paramount. This paper's revelations highlight the potential for future research aimed at integrating such capabilities and advancing the boundaries of reasoning LLMs for a more robust, error-tolerant natural language processing ecosystem.

Conclusion

Through systematic analysis, the paper uncovers a critical inefficiency in reasoning models, emphasizing the need for comprehensive improvements in their training methodologies. The work serves as both a warning and a guide for future research directions, encouraging the development of training regimes that foster genuine critical reasoning skills. As the AI field burgeons, addressing these issues is crucial to harness the full potential of LLMs, ensuring they perform reliably across a diverse array of problem domains.

Related Papers

Find Related Papers

Tweets

https://twitter.com/zhoutianyi/status/1910435579587625176

https://twitter.com/TheTuringPost/status/1912651520879046855

https://twitter.com/paws_ed/status/1910749909285757421

https://twitter.com/GptMaestro/status/1910672853772673488

https://twitter.com/Kseniase_/status/1911379417840107713

YouTube

Show All Videos