Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens (2505.13775v2)

Published 19 May 2025 in cs.LG and cs.AI

Abstract: Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. In this paper, we critically examine that interpretation by investigating how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces and which are claimed to display behaviors like backtracking, self-verification etc.-actually influence model performance. We train transformer models on formally verifiable reasoning traces and solutions, constraining both intermediate steps and final outputs to align with those of a formal solver (in our case, A* search). By constructing a formal interpreter of the semantics of our problems and intended algorithm, we systematically evaluate not only solution accuracy but also the correctness of intermediate traces, thus allowing us to evaluate whether the latter causally influences the former. We notice that, despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions. To further show that trace accuracy is only loosely connected to solution accuracy, we then train models on noisy, corrupted traces which have no relation to the specific problem each is paired with, and find that not only does performance remain largely consistent with models trained on correct data, but in some cases can improve upon it and generalize more robustly on out-of-distribution tasks. These results challenge the assumption that intermediate tokens or "Chains of Thought" induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly correct forms) as evidence of human-like or algorithmic behaviors in LLMs.

Summary

  • The paper reveals that intermediate tokens enhance final answer accuracy even when they lack meaningful semantic content.
  • The paper employs controlled grid-based maze experiments using formal A* search traces to objectively evaluate model performance.
  • The paper finds that training with semantically irrelevant, swapped traces can outperform correct trace training on out-of-distribution tasks.

This paper, "Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens" (2505.13775), critically examines the common interpretation that the performance gains seen in Large Reasoning Models using intermediate tokens, often called "Chain of Thought" (CoT), are due to these tokens reflecting genuine, interpretable reasoning processes. The authors challenge the assumption that the semantics of these intermediate steps are what causally drives improved solution accuracy.

To investigate this, the researchers adopt a controlled, small-scale experimental approach using a well-defined, formal domain: grid-based pathfinding (solving mazes) using the A* search algorithm. Unlike large, opaque models with natural language CoTs that are difficult to verify, this setup allows for systematic evaluation of both the final solution and the intermediate steps against a formal ground truth (the A* algorithm's execution).

The core methodology involves:

  1. Domain Setup: Using a 30×3030\times30 grid maze pathfinding problem, where the task is to find a sequence of actions (plan) from a start to a goal cell, avoiding walls. They use various maze generation algorithms (Wilson's, Kruskal's, DFS for in-distribution and related OOD; Drunkard's Walk, SearchFormer-style for distinct OOD evaluation).
  2. Data Generation: Generating 50,000 maze instances with optimal paths and corresponding A* search execution traces. The A* trace is linearized into a sequence of "create" and "close" actions for nodes, including their cost (gg) and heuristic (hh) values.
  3. Trace Validation: Constructing a formal validator for these linearized A* traces. This validator simulates the A* process by parsing the sequence of intermediate tokens. It checks if the generated operations are consistent with the A* algorithm's mechanics (e.g., valid neighbor creation, correct open/closed list management, lowest f-value selection for closing nodes). This allows distinguishing valid traces from invalid ones, unlike subjective evaluation of natural language.
  4. Model Training: Training transformer models (a modified Qwen2.5 0.5B with a specific vocabulary size) from scratch on three types of datasets derived from the maze problem instances:
    • Solution-Only: Model is trained to output only the final plan.
    • Correct A* Traces: Model is trained to output the correct A* trace followed by the correct plan, teacher-forcing the intermediate tokens.
    • Swapped A* Traces: Model is trained on the correct plan for a given maze problem, but paired with an A* trace randomly selected from a different, unrelated maze problem. These traces retain the formal structure of A* steps but have no semantic relevance to the problem they are paired with.

The experiments evaluate the models on their ability to produce correct plans (solution accuracy) and, for trace-trained models, the validity of the generated intermediate traces according to the formal validator.

Key Findings and Practical Implications:

  • Trace Semantics Not Strongly Correlated with Plan Accuracy: Models trained on correct A* traces show performance improvement over the solution-only baseline, confirming previous findings that intermediate tokens help. However, the trace validity evaluation reveals a loose correlation between producing a formally correct trace and a correct final plan. Models trained on correct traces frequently produce invalid traces, even when the final plan is correct.
  • Reasonless Tokens Can Be Effective: Crucially, models trained on the swapped (semantically irrelevant) A* traces achieve performance levels comparable to, and in some cases better than, models trained on correct A* traces. This is particularly noticeable on out-of-distribution maze types like Drunkard's Walk (26.0% plan validity for swapped vs. 2.5% for correct trace) and DFS (41.7% vs. 30.8%). The swapped trace model, by design, produces traces that are 0% valid according to the formal validator, yet its final output accuracy is high.
  • Intermediate Tokens as Prompt Augmentation: The authors propose that the effectiveness of intermediate tokens might not lie in their semantic content representing a reasoning process, but rather in them serving as a form of "prompt augmentation." Generating a sequence of tokens before the final answer might simply provide the model with additional context or structure that facilitates finding the correct solution, regardless of whether this sequence is human-interpretable or algorithmically sound. This aligns with findings in adversarial prompting literature where non-meaningful inputs can significantly alter model behavior.
  • Caution Against Anthropomorphism: The results caution against anthropomorphizing intermediate tokens or "Chains of Thought" as genuine reasoning or thinking processes. While they may look like reasoning, especially in natural language, their causal link to correct solutions might be structural (providing a form of prompt augmentation or computation sequence) rather than semantic (executing a predictable algorithm).

Implementation Considerations:

  • The paper uses a relatively small transformer model (modified Qwen2.5 0.5B) trained from scratch on a large dataset of maze problems (50,000 examples). This demonstrates that these findings are not limited to massive models but can be observed in controlled settings.
  • The formal validator for A* traces is a critical component, allowing for objective evaluation of intermediate token correctness, which is infeasible for natural language traces. Implementing such validators for specific problem domains is key to understanding the true nature of model execution traces.
  • The comparison across diverse maze generation algorithms highlights the importance of evaluating models on out-of-distribution tasks to assess generalization and robustness, revealing surprising benefits from training on seemingly "reasonless" traces.
  • The concept of training with "swapped" or corrupted traces offers a novel data augmentation strategy or experimental probe for understanding model behavior, suggesting that exploring non-semantic data manipulations could yield performance benefits or mechanistic insights.

In conclusion, the paper strongly suggests that for transformer models trained on task demonstrations including intermediate steps, the functional benefit might stem more from the structured sequence generation itself than from the semantic meaning or algorithmic validity of the intermediate tokens. This challenges prevailing interpretations of CoT and highlights the need for more rigorous, formal methods to understand how models use these intermediate outputs.

Youtube Logo Streamline Icon: https://streamlinehq.com