The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
(2506.06941v1)
Published 7 Jun 2025 in cs.AI, cs.CL, and cs.LG
Abstract: Recent generations of LLMs have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.
The Interplay of Problem Complexity and Reasoning Model Capabilities
Understanding the strengths and limitations of reasoning models, particularly LLMs, necessitates a rigorous examination through the lens of problem complexity. Recent research highlights that while these models demonstrate impressive capabilities on various tasks, their performance is intricately tied to the inherent structure and demands of the problems they attempt to solve. Analyzing this relationship reveals not only the current frontiers of AI reasoning but also fundamental architectural and behavioral constraints.
Defining Problem Complexity for Reasoning Models
Problem complexity in the context of evaluating AI reasoning is characterized through several formalisms, moving beyond simplistic notions of task difficulty. One prominent approach leverages the framework of circuit complexity, categorizing problems based on the type and structure of computational circuits required for their solution. Classes like AC0, TC0, and NC1 define increasing levels of complexity based on factors such as circuit depth, size, gate types (Boolean, threshold), and uniformity conditions (L-uniformity, DLOGTIME-uniformity) (Chen et al., 9 Dec 2024). Problems like arithmetic formula evaluation, Boolean formula value problems, and permutation composition serve as benchmarks, residing in complexity classes like NC1 and thus potentially outside of classes like TC0 (Chen et al., 9 Dec 2024, Merrill et al., 12 Apr 2024).
Another perspective characterizes complexity through the structure of sequential processing required. This is often formalized using concepts from automata theory, such as Deterministic Finite Automata (DFAs). In this view, complexity can be measured by the run length (N) of the DFA (the minimum number of sequential steps for a task instance) and the state-space size (k) (the number of states in the DFA). This approach provides a structured way to analyze how reasoning models handle tasks requiring implicit state tracking and navigating decision spaces (Lee et al., 2 Apr 2025).
For specific tasks like logic puzzles or mathematical problems, complexity can be characterized by the problem size (e.g., grid dimensions in a Tents puzzle) or the logical sufficiency of the provided information. The "Missing Premise" (MiP) scenario, for instance, introduces complexity by making a problem ill-posed, where the provided premises are insufficient for a unique solution, requiring the model to identify this logical gap rather than just compute a solution (Fan et al., 9 Apr 2025).
Strengths and Limitations of Current Reasoning Models
LLMs and other reasoning models exhibit notable strengths on a variety of complex tasks. They demonstrate remarkable text generation capabilities and have shown significant breakthroughs in reasoning performance through advancements in training paradigms (Estermann et al., 19 Mar 2025). Techniques like Chain-of-Thought (CoT), Tree-of-Thought (ToT), and other test-time computation methods enable models to generate intermediate reasoning steps, improving performance on unseen and challenging tasks (Stepanov et al., 15 Apr 2025, Qi et al., 2023, Estermann et al., 19 Mar 2025). Models trained with COT-RL tend to generate longer reasoning chains and often achieve better accuracy (Lee et al., 2 Apr 2025). They show some capacity for algorithmic reasoning and can adapt the amount of computation (reasoning tokens) to problem difficulty up to a certain point (Estermann et al., 19 Mar 2025).
However, significant limitations emerge when confronting problems that push against their computational or architectural boundaries. A fundamental limitation identified through complexity analysis is that standard architectures like Transformers and common State-Space Models (SSMs), such as Mamba, appear to be limited to the complexity class TC0 (Chen et al., 9 Dec 2024, Merrill et al., 12 Apr 2024). This theoretical constraint implies they cannot solve problems provably outside TC0, such as NC1-complete problems like permutation composition, despite their recurrent formulation suggesting state-tracking capabilities. This suggests the "state" in these models might be an "illusion" in the sense that it doesn't confer the expressive power of true recurrent models like RNNs for these specific tasks (Merrill et al., 12 Apr 2024).
Another critical limitation is the phenomenon of overthinking. Reasoning models, especially those trained for explicit step-by-step thinking, tend to generate excessively long and often redundant reasoning chains, even for simple or ill-posed problems (Fan et al., 9 Apr 2025, An et al., 27 May 2025). This "cheap overthinking" (Fan et al., 9 Apr 2025) can be particularly problematic in scenarios like the Missing Premise problem, where the models fail to exhibit critical thinking skills to identify the ill-posed nature of the query and abstain, instead generating ineffective and lengthy deliberations (Fan et al., 9 Apr 2025, Cuadron et al., 12 Feb 2025). Overthinking correlates with decreased performance in agentic tasks (Cuadron et al., 12 Feb 2025) and can lead to a decline in accuracy when reasoning chains exceed an optimal length (Lee et al., 2 Apr 2025).
Furthermore, models struggle with tasks requiring flexible reasoning and adapting strategies. Evaluations on game-theoretic tasks reveal distinct behaviors across scenarios, with failures observed in complete and deterministic games despite competence in probabilistic ones (Duan et al., 19 Feb 2024). Similarly, in clinical problem-solving, LLMs demonstrate inflexibility and a propensity for pattern matching over genuine flexible reasoning, leading to poor performance on scenarios designed to exploit the Einstellung effect (Kim et al., 5 Feb 2025). This inflexibility can also manifest as persistently ruminating on previously explored problem formulations, obstructing further exploration (Marjanović et al., 2 Apr 2025).
Linking Strengths and Limitations to Problem Complexity
The capabilities and shortcomings of reasoning models are directly linked to how they interact with problem complexity. The ability of models to leverage increased reasoning tokens (a perceived strength) scales with the sequential complexity (run length N) of tasks characterized by DFAs, suggesting that reasoning helps with implicit state tracking. However, this scaling doesn't hold proportionally with the combinatorial complexity (state-space size k), indicating a potential limitation in handling problems requiring navigation of large decision spaces (Lee et al., 2 Apr 2025).
The critical observation that reasoning effort (measured by tokens) scales with problem size (e.g., in the Tents puzzle) but only up to a certain threshold highlights a crucial limitation (Estermann et al., 19 Mar 2025). Beyond this threshold, the logical coherence breaks down, and reasoning effort may even decrease, leading to performance collapse (Estermann et al., 19 Mar 2025). This suggests that while models can adapt their computational effort to moderate increases in complexity, their fundamental reasoning architecture hits a limit where further effort does not translate to successful problem-solving.
The overthinking phenomenon is another direct link to complexity, specifically ill-posed or complex problems that trigger an inappropriate application of learned step-by-step reasoning patterns. What is a strength for structured, solvable problems becomes a liability for complex, ill-defined ones, where the model lacks the critical thinking to recognize the futility of continued reasoning (Fan et al., 9 Apr 2025).
Architectural limitations, as identified through complexity analysis showing models reside in TC0, directly imply an inability to solve problems in higher complexity classes (NC1). This theoretical limitation explains empirical failures on tasks requiring computation beyond constant-depth circuits, such as certain state-tracking or sequential computation problems (Chen et al., 9 Dec 2024, Merrill et al., 12 Apr 2024).
The "Illusion" of Thinking
The term "illusion" appears in this domain in multiple contexts. The most prominent is the "illusion of state" in SSMs and Transformers, where despite architectural features or formulations that might suggest strong state-tracking capabilities (like recurrence in SSMs), their formal computational power is limited to TC0, akin to non-recurrent models (Merrill et al., 12 Apr 2024). This means they cannot truly track state for problems requiring computation outside this class, despite the appearance of sequential processing.
Another perspective on "illusion" relates to the perceived versus actual reasoning capabilities. The lengthy and structured reasoning outputs generated by models can create an "illusion of deep thinking" or "genuine critical thinking" (Fan et al., 9 Apr 2025). However, when faced with specific complexities, such as ill-posed problems, the model's behavior (e.g., overthinking, inability to abstain) reveals that this outward appearance of reasoning does not necessarily correspond to a robust underlying understanding of problem solvability or the ability to adapt reasoning strategies (Fan et al., 9 Apr 2025). Similarly, the observation that extending reasoning traces can initially improve performance but then decline, coupled with an increase in output variance, is argued to create an "illusion of improved reasoning" that is merely an artifact of increased uncertainty affecting evaluation metrics (Ghosal et al., 4 Jun 2025).
Practical Implications and Future Directions
Understanding these strengths and limitations through problem complexity is crucial for practical applications. Deployment of LLMs in critical domains requires acknowledging their brittleness when faced with tasks outside their demonstrated complexity envelope or those requiring flexible adaptation. The overthinking phenomenon highlights the need for mechanisms to control reasoning length and incorporate critical thinking or abstention capabilities (Fan et al., 9 Apr 2025, An et al., 27 May 2025). Strategies like adaptive thinking modes that dynamically adjust reasoning effort based on problem difficulty are being explored to balance efficiency and accuracy (Yong et al., 23 May 2025, Zhang et al., 21 May 2025).
The theoretical limitations identified by complexity analysis point towards the need for novel architectures or training paradigms that can surpass the TC0 barrier if models are to tackle problems requiring higher complexity computation or true state tracking (Chen et al., 9 Dec 2024, Merrill et al., 12 Apr 2024). While extended SSMs show promise in this regard, challenges remain (Merrill et al., 12 Apr 2024).
Developing better evaluation metrics and frameworks that go beyond final answer accuracy to assess the structure and faithfulness of the reasoning process itself is essential (Xiong et al., 19 May 2025, Lu et al., 26 May 2025). Furthermore, aligning models to different cognitive styles (System 1 vs. System 2 thinking) based on task demands might offer a path towards more flexible and reliable reasoning (Ziabari et al., 18 Feb 2025).
Conclusion
The paper of reasoning models through the lens of problem complexity reveals a nuanced picture. While models have achieved impressive reasoning capabilities, fundamental limitations rooted in their architecture and training paradigms become apparent when problem complexity increases beyond certain thresholds. The concept of "illusion" captures the discrepancy between the appearance of sophisticated thinking and the underlying computational or behavioral constraints. Addressing these limitations requires continued research into novel architectures, training methods that foster critical thinking and adaptive reasoning, and more sophisticated evaluation frameworks grounded in complexity theory and cognitive science.