Chain-of-Thought in LLMs

Updated 9 December 2025

Chain-of-thought is a reasoning framework that decomposes tasks into explicit intermediate steps, enabling LLMs to tackle complex compositional problems.
Empirical evaluations show that both full and compressed CoT prompting outperform direct answers, achieving >95% accuracy on tasks like multi-digit multiplication and dynamic programming.
Intervention studies reveal that modifying intermediate tokens causally affects final outputs, confirming their role as mutable variables in neural sequence computation.

Chain-of-thought (CoT) process in LLMs refers to the explicit generation and serial propagation of intermediate reasoning steps before emitting a final answer, typically as natural-language or symbolic tokens. Unlike direct question–answer mapping, CoT structures the output as a sequence of variable-carrying steps, each forming a mutable state analogous to program variables, and is foundational for enabling LLMs to solve complex, compositional tasks such as multi-digit arithmetic, dynamic programming, and algorithmic reasoning. The inner mechanism of CoT centers on the decomposition of a joint conditional modeling $P(y, z | x) = P(z | x) \cdot P(y | x, z)$ , where $z$ is the chain of intermediate reasoning steps conditioned on input $x$ and $y$ is the answer conditioned on both $x$ and $z$ (Zhu et al., 8 May 2025). This article synthesizes the design, function, analysis, and empirical findings surrounding chain-of-thought in modern sequence models.

1. Formal Role and Mechanism of Chain-of-Thought

Standard decoder-only Transformers, when equipped with CoT prompting (using markers such as <COT> and </COT>), are forced to output the full reasoning trace $z$ as explicit intermediate steps. In complex compositional problems—multi-digit multiplication, dynamic programming, and recursive planning—every subsequent step depends causally on the computed intermediate result in the previous step. Formally, instead of directly maximizing $P(y|x)$ , the model outputs $z$ and then $y$ by maximizing $P(y, z | x)$ , which instantiates the trace as a computational graph of dependent sub-calculations. This "unrolls" what is exponentially complex in direct mapping into polynomial-time serial computation that the Transformer can manage (Zhu et al., 8 May 2025).

2. Experimental Paradigms and Token Identification

Rigorous empirical evaluation demonstrates that CoT is a necessary and sufficient protocol for compositional tasks. On synthetic benchmarks, such as long multiplication and DP table filling, plain prompting yields 0% accuracy for problems beyond trivial instances; full CoT achieves >95% on 5×5 multiplication and large DP grids. Three variants are tested:

Plain prompting: No intermediate tokens; direct answer.
Full CoT: Natural-language trace including both semantic scaffolding and variable-like tokens.
Compressed CoT: Only tokens directly storing intermediate numeric results ("result tokens")—semantic tokens dropped.

In multiplication, every token encoding digitwise result or carry operation is flagged; semantic phrases (e.g., "carry," "calculate") are omitted for compressed CoT. In DP, evolving cell values in the DP table are the result tokens.

Prompt Variant	Accuracy 5×5 Multiplication	Accuracy 4×5 DP	Key Feature
Plain Prompt	0%	0%	No reasoning trace
Full CoT	98.7%	>95%	Textual trace of all computation
Compressed CoT	99.2%	Matches/surpasses	Only intermediate-value tokens

Compressed CoT, retaining only variable-like tokens, matches or exceeds full CoT performance, indicating that the mere presence of these "variable tokens" is essential (Zhu et al., 8 May 2025).

3. Latent Embedding of Intermediate Results

How the model stores intermediate results is probed via the introduction of latent <LAT> tokens encoded as one-hot vectors representing full numbers (not text tokens). Transformer modules are augmented to read and write from such latent vectors via linear projections. On multiplication up to 5×5, latent-CoT accuracy matches full textual CoT; on DP, merging more states into a single token decreases accuracy (by ~9%) due to model capacity limits. Thus, intermediate steps do not need to be textually exposed but must be accessible via the internal state, confirming that CoT tokens function as abstract variables—potentially in latent embedding space—provided the read/write mechanism carries sufficient fidelity.

4. Intervention and Causal Tracing of CoT Tokens

Controlled intervention experiments assess whether CoT tokens causally determine answers. After randomly substituting a "result token" (digit or DP cell value) mid-trace and re-generating downstream tokens, models propagate the numerical change through the solution path:

Multiplication: 73.8% of carry edits yield predictable shifts in final quantity.
DP: ~91% success rate.
Largest failure mode is shortcut errors—e.g., multiplying by 1 yields a direct copy, bypassing the intended computation path.

An explicit error classification (addition, reconstruction, copy error, shortcut error) documents points of divergence, empirically corroborating the variable-passing interpretation.

5. Complexity Limits and Computational Distance

There exists an intrinsic limit to how much computation can be "packed" between CoT tokens. Systematically merging adjacent latent tokens—collapsing multiple DP cell values into fewer, higher-dimensional tokens—increases the computational distance each token must carry. Quantitative probing with a linear decoder finds a sharp drop in both element-wise and token-level recoverability once numeric states exceed two digits or when complexity threshold is breached. This demonstrates a finite, model-dependent capacity for compositional computation between variable-passing points (Zhu et al., 8 May 2025).

6. Theoretical Implications

Chain-of-thought tokens are not mere linguistic scaffolding; they act as mutable program variables internal to the Transformer. The causal effect of these tokens is direct: perturbations yield corresponding shifts in final output, except in shortcut scenarios where the model exploits structural regularities. There is a practical upper bound on the complexity of transformations between such variables—exceeding this bounds leads to reasoning breakdown. Embedding reasoning traces as latent variables is possible, and only those explicit "value-passing" steps are indispensable for serial logic propagation.

7. Future Directions and Open Problems

Possible research avenues include the automated identification of variable-like tokens for broader task domains, formal characterization of computational and precision bounds for Transformer architectures in variable-passing settings, and the development of more efficient CoT protocols that compress trivial computation while preserving essential variable-passing structure. Understanding and leveraging model capacity limits, faithfulness constraints, and shortcut behavior remain open technical challenges. The realization that CoT fundamentally operates as a variable-passing mechanism within neural sequence models reframes the prompting paradigm and provides substantial interpretive leverage for advanced reasoning applications (Zhu et al., 8 May 2025).

PDF Markdown Chat (Pro)

References (1)

Chain-of-Thought Tokens are Computer Program Variables (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Chain-of-Thought Process.