Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity (2506.09250v1)

Published 10 Jun 2025 in cs.AI and cs.LG

Abstract: Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments systematically exceed model output token limits at reported failure points, with models explicitly acknowledging these constraints in their outputs; (2) The authors' automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N > 5 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems. When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures. These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.

Summary

The paper highlights that LRMs appear to collapse in complex puzzles due to artificial token constraints rather than inherent reasoning limits.
It reveals that flawed evaluation methods, such as unsolvable test cases and premature response truncation, misrepresent the models’ true capabilities.
The work advocates for refined evaluation frameworks that account for computational difficulty and ensure problem solvability to accurately assess reasoning performance.

A Critical Review of "Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity"

Introduction

The paper "Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" critiques the findings presented by Shojaee et al. regarding the performance of Large Reasoning Models (LRMs) on complex planning puzzles. The original claims suggested an "accuracy collapse" in these models beyond a certain complexity threshold. However, the critique identifies these observations as artifacts of experimental design rather than intrinsic limitations of the models.

Recognition and Response to Output Constraints

A fundamental insight offered by the paper is that LRMs demonstrate an awareness of output constraints, contradicting the interpretation of their behavior as reasoning collapses. Specifically, when solving Tower of Hanoi problems, models truncate their output due to limitations in token constraints, which suggests an understanding of the inherent solution pattern. This reveals a failure in the automated evaluation frameworks to differentiate between genuine reasoning limitations and practical constraints imposed by available resources.

Evaluation Methodologies and Analytical Errors

The current evaluation methodologies employed in Shojaee et al.'s paper do not adequately address the context-awareness of the models, leading to misinterpretations of their capabilities. For instance, models that choose not to exhaustively enumerate all possible moves in the Tower of Hanoi puzzle are erroneously considered unable to solve these problems. This highlights a broader issue in automated evaluation systems that fail to account for the models' capability to recognize and adapt to their token limits, leading to misinformed conclusions about their reasoning capabilities.

Mischaracterization of Complexity in River Crossing Problems

The critique further identifies a confounding factor in the evaluation of River Crossing puzzles. Shojaee et al. tested instances that are mathematically unsolvable (for $N \geq 6$ with boat capacity $b = 3$ ), scoring the models as failures for recognizing these as unsolvable. This error emphasizes the necessity of careful benchmark design, particularly ensuring the logical viability of test cases, as automatic evaluations of unsolvable problems erroneously suggest reasoning deficiencies.

Models’ Abbreviated Outputs and Lack of Calibration

The analysis demonstrates that models may prematurely truncate solutions before reaching context limits, indicative of their decision-making process being poorly calibrated regarding their output limits. Although models can generate recursive implementations efficiently, they often pre-emptively cut short their outputs to manage length constraints, which biases evaluations toward perceived reasoning failures. The paper advocates for evaluations that account for these instances of premature termination, which are engineering decisions rather than indicative of reasoning incapacities.

Alternative Evaluations and Revised Complexity Claims

Introducing alternative representations for complex problems, such as generating a recursive Lua function instead of exhaustive move lists, restores models' performance, highlighting their intact reasoning capacity when not constrained by token enumeration. Furthermore, the paper critiques the reliance on solution length as a metric for problem complexity. It argues for metrics that capture the computational difficulty instead, noting that simple but long-sequence problems exhibit different complexities from those with fewer but computationally intense steps.

Conclusion

The analysis provided in the paper emphasizes the importance of re-evaluating current methodologies to more accurately assess AI reasoning capabilities. It highlights the necessity of designing evaluation frameworks that distinguish between reasoning abilities and output constraints, verifying problem solvability before assessing model performance, and adopting complexity metrics that genuinely reflect computational hardships. This critique serves not only to redefine LRMs' apparent limitations but also to refine evaluation methodologies for future AI reasoning research. In conclusion, the focus should shift from questioning LRMs' reasoning capabilities to interrogating the robustness and fidelity of their evaluations.