Reasoning LLMs are Wandering Solution Explorers
(2505.20296v1)
Published 26 May 2025 in cs.CL, cs.AI, cs.LG, and cs.MM
Abstract: LLMs have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderers rather than systematic explorers. Through qualitative and quantitative analysis across multiple state-of-the-art LLMs, we uncover persistent issues: invalid reasoning steps, redundant explorations, hallucinated or unfaithful conclusions, and so on. Our findings suggest that current models' performance can appear to be competent on simple tasks yet degrade sharply as complexity increases. Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.
This paper, "Reasoning LLMs are Wandering Solution Explorers" (Lu et al., 26 May 2025), argues that despite exhibiting impressive problem-solving capabilities through techniques like Chain-of-Thought (CoT) and tree-based reasoning, current Reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. Instead, the authors characterize them as "wanderers."
The paper formalizes problem-solving as an exploration of a solution space defined by states S, transitions T, an initial state s0, and goal states G. An exploration is a trace (sequence of states) J=(sj0,…,sjn−1). A trace is valid if transitions follow T. Goal states are solutions, and dead-ends are non-goal states with no path to unexplored states.
A systematic exploration is defined by three properties:
Validity: The trace must adhere to the problem's transition structure.
Effectiveness: The trace must reach at least one goal state.
Necessity: Every state in the trace must be essential for discovering a solution or eliminating alternatives, avoiding superfluous steps.
The authors argue that current RLLMs fail to satisfy these properties, exhibiting "wandering" exploration. They demonstrate, using a depth-first search (DFS) analogy on a binary tree, that wandering behavior leads to an exponential decrease in success probability as problem depth increases. While this degradation might be masked on simple tasks with many solutions (leading to misleading performance "plateaus"), it causes abrupt failure on more complex problems.
The paper identifies three general classes of failure modes in wandering explorations:
Invalid Explorations: Violations of the problem's structure (T).
Boundary Violation: Accessing states outside defined limits (e.g., out-of-bounds array indices, invalid game moves like reusing numbers in the 24 game example). Often caused by relying too much on local context.
Procedure Omission: Prematurely stopping or skipping necessary parts of the search space (e.g., failing to enumerate all unique permutations). Attributed to lacking backtracking criteria or global planning.
Incorrect Backtracking: Returning to an incorrect or inconsistent state after hitting a dead-end or completing a branch (e.g., complex search tasks like Permutation with Duplicates). Linked to the linear nature of CoT not modeling stack-based state management.
Unnecessary Explorations: Consuming computational resources without making genuine progress (violating necessity).
State Revisitation: Returning to previously explored states or partial solutions (e.g., repeatedly trying the same equation combinations in the 24 game). Caused by lacking state tracking or due to context window limitations favoring recent tokens.
Infinite Self-Loop: Getting stuck in a loop of repeating the same sequence of steps indefinitely (e.g., continuous, unproductive attempts in a difficult puzzle). Often results from missing loop exit strategies or fallback plans.
Evaluation Errors: Failures in processing information during the search, distinct from choosing the next move.
State Staleness: Using outdated information about the problem state (e.g., using points already merged into a cluster in hierarchical clustering). Indicates poor working memory management.
Execution Error: Incorrectly performing calculations or lookups (e.g., arithmetic mistakes in prime factorization). Highlights LLMs' weakness in precise computation and susceptibility to hallucination. Tool integration is suggested as a countermeasure.
Unfaithful Conclusion: The final answer contradicts or incompletely summarizes the reasoning trace that led to it (e.g., reporting only a subset of found solutions). Shows that the final output may not faithfully reflect the entire exploration process.
To audit these reasoning traces systematically, the authors selected eight computational tasks (Counting Elements, Sliding Window Max, Flood Fill, Edit Distance, Hierarchical Clustering Order, Prime Number Factorization, Permutation with Duplicates, and the 24 Game). These tasks were chosen for their controllable size, verifiable traces (decomposable into atomic steps), and standard solving procedures. Strict formatting rules were imposed on the RLLMs' output to enable reliable, rule-based auditing against ground truth.
Empirical evaluation on the Permutation with Duplicates task across six state-of-the-art RLLMs (including Deepseek-R1, Anthropic Sonnet 3.7, and OpenAI O3) quantitatively demonstrated the wandering behavior. The "solution coverage ratio" (ratio of unique permutations found to the total) showed a clear degradation as the problem complexity (solution space size) increased, confirming the exponential deterioration predicted by the theoretical model. While larger models performed better overall, all models eventually showed performance decline, reinforcing the argument that current RLLMs lack systematic exploration capabilities.
The paper concludes by posing three key research challenges:
Architectural Design: How to build models with inductive biases for structured search, state tracking, and backtracking, potentially integrating external modules.
Training Signals: Developing training paradigms (like process supervision or structured search imitation) that incentivize systematic reasoning rather than just coherent text generation.
Evaluation: Creating new benchmarks and metrics that assess the quality and structure of the reasoning process itself, not just the final answer, to identify when and why reasoning breaks down.
Overall, the paper serves as a critique of the current state of RLLM reasoning, highlighting fundamental limitations in their ability to systematically explore complex solution spaces. It advocates for a shift in focus towards ensuring robustness and systematicity in AI reasoning systems through architectural changes, new training methods, and more rigorous evaluation.