Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs (2505.00127v1)

Published 30 Apr 2025 in cs.CL and cs.AI

Abstract: LLMs are increasingly optimized for long reasoning, under the assumption that more reasoning leads to better performance. However, emerging evidence suggests that longer responses can sometimes degrade accuracy rather than improve it. In this paper, we conduct a systematic empirical study of the relationship between reasoning length and answer correctness. We find that LLMs tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones, failing to extend their reasoning when it is most needed. This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately. Furthermore, we investigate the effects of length reduction with a preference optimization algorithm when simply preferring the shorter responses regardless of answer correctness. Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy. Our findings highlight generation length as a meaningful signal for reasoning behavior and motivate further exploration into LLMs' self-awareness in reasoning length adaptation.

Summary

Empirical Analysis of Reasoning Length and Accuracy in LLMs

The paper "Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and Correctness in LLMs" addresses the intricate relationship between reasoning length and answer correctness in LLMs. With the advent of reasoning-oriented LLMs such as OpenAI O1 and DeepSeek-R1, there has been a growing emphasis on enhancing System-2 thinking capabilities through extended reasoning chains. This research investigates the assumption that increasing reasoning length invariably leads to improved performance and problem-solving capabilities.

Key Findings

Non-linear Relationship between Reasoning Length and Correctness: The paper reveals a non-monotonic relationship between reasoning length and accuracy. For a fixed question, accuracy initially improves with increased reasoning length, but beyond a certain threshold, excessive reasoning leads to a decline in performance. This pattern was observable across different datasets and models, emphasizing that overly verbose reasoning can introduce compounding errors and degrade accuracy.
Impact of Question Difficulty: The analysis demonstrates that models often generate longer responses for difficult questions, yet fail to consistently adapt to perceived difficulty. For relatively easy questions, LLMs respond with appropriately extended reasoning. However, for questions deemed difficult, the models sometimes respond with inadequate reasoning lengths, underscoring a misjudgment of difficulty levels.
Optimization for Shorter Responses: The paper investigates a preference optimization algorithm that encourages shorter generation lengths irrespective of correctness signals. This approach maintained acceptable accuracy levels while significantly reducing the average token length by approximately 30% to 60%. Interestingly, incorrect responses, typically longer, contributed more to the reduction in token length than correct responses.

Practical and Theoretical Implications

The findings have significant implications for both the practical deployment and theoretical understanding of LLMs:

Efficiency in Computation:

Preference optimization algorithms that favor shorter responses can enhance computational efficiency without sacrificing accuracy. This can be particularly beneficial in resource-constrained environments or applications requiring swift response generation.

Model Calibration and Self-awareness:

The research underscores a need for improved model calibration, particularly in assessing and adjusting reasoning length based on problem difficulty. Enhancing self-awareness in LLMs regarding the optimal reasoning length can mitigate issues of overthinking and underthinking, fostering adaptive thinking processes.

Future Research Directions:

Future research could focus on refining reasoning adaptation mechanisms within LLMs, enabling them to more accurately gauge when to engage in deeper reasoning. Additionally, extending this analysis to a broader set of models and datasets could offer further insights into the generality of these findings.

Conclusion

The paper provides a nuanced understanding of the correlation between reasoning length and answer correctness within LLMs, challenging the assumption that longer reasoning chains invariably lead to improved accuracy. By exploring how models react to varying question difficulties and optimizing for shorter reasoning paths, the authors lay the groundwork for more efficient and adaptive LLMs. This research highlights the importance of balancing reasoning length with accuracy and efficiency, paving the way for more sophisticated AI systems capable of reasoning with genuine adaptability.

Related Papers

Tweets

https://twitter.com/fly51fly/status/1919147625003000088

https://twitter.com/GptMaestro/status/1920277594210968028

https://twitter.com/GptMaestro/status/1921637393829958001