Efficiently Representing Algorithms With Chain-of-Thought Transformers

Published 18 Jun 2026 in cs.LG, cs.AI, and cs.CL | (2606.19697v1)

Abstract: The increasing popularity of \emph{reasoning} models -- LLMs that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra's algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat'' instruction set, and only logarithmic for multiplication-free flat instructions -- in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper establishes that chain-of-thought transformers can simulate Word RAM algorithms with only polylogarithmic overhead per step using innovative attention mechanisms.
It presents three architectures—polylog-width transformers with rightmost UHAT, fixed-width transformers with continuous CoT, and transformer-RNN hybrids—that efficiently handle random memory access and bit-serialization.
The work bridges theoretical models and practical LLM design by highlighting the minimal architectural extensions needed for efficient algorithmic reasoning.

Efficient Simulation of Word RAM Algorithms by Chain-of-Thought Transformers

Introduction

The paper "Efficiently Representing Algorithms With Chain-of-Thought Transformers" (2606.19697) examines the computational efficiency of simulating algorithms, traditionally expressed in the Word RAM model, using transformer architectures augmented with chain-of-thought (CoT) reasoning. Traditionally, algorithmic reasoning claims for transformers and LLMs have relied on Turing machine (TM) simulations, which are theoretically universal but are not algorithmically efficient for practical, textbook algorithms due to the lack of random memory access and high overhead. This work addresses whether CoT-based transformers can efficiently simulate general algorithms at their standard (Word RAM) complexities, bridging an important expressivity and efficiency gap.

Background and Problem Statement

The Turing machine is not a natural model for algorithmic reasoning due to inefficient sequential access and quadratic overhead when simulating RAM algorithms. In contrast, the Word RAM model aligns well with textbook algorithms by providing random-access memory and constant-time operations on word-sized data. As such, simulating an $n$ -step Word RAM algorithm using a TM or through prior CoT constructions incurs up to $\mathcal{O}(n^2)$ overhead—rendering such approaches impractical for tasks like sorting ( $n \log n$ on RAM; $\Omega(n^2)$ on TMs [HENNIE1965553]) or Dijkstra’s algorithm ( $\mathcal{O}(E + V\log V)$ on RAM).

The core research question is: Can CoT transformers simulate arbitrary Word RAM algorithms at a per-step overhead that is only polylogarithmic in the input size $n$ , rather than quadratic?

Main Results

The paper provides formal constructions showing that, for several important classes of transformer or hybrid architectures, CoT reasoning enables efficient and faithful simulation of Word RAM algorithms up to a poly-logarithmic overhead. This is achieved in three progressively uniform and practical architectures:

Polylogarithmic-Width Transformers with Rightmost Unique Hard Attention (UHAT): These models with fixed finite-precision and $\mathcal{O}(\log^2 n)$ width can simulate any $t$ -step, word-size $w = O(\log n)$ Word RAM algorithm with $\mathcal{O}(t w^2)$ CoT tokens. For instruction sets excluding multiplication/division/modulo, the overhead tightens to $\mathcal{O}(n^2)$ 0 tokens per step, matching the information-theoretic lower bound for discrete token representations (one token per bit per step).
Fixed-Width Transformers with Continuous Chain-of-Thought (CoT): The reasoning steps are represented as vectors rather than tokens, and each decoding step emits a soft token alongside a discrete output token. Such models can efficiently serialize/deserialize word-sized operands and achieve the same polylogarithmic per-step overhead, removing the need for growing model width and yielding uniform parameterization.
Hybrid Architectures (Transformer + Linear RNN): By adding a single linear RNN layer (e.g., RWKV-like or DeltaNet), bit-serialization of operands (writing and reading word values bit-by-bit) is handled in the RNN hidden state, allowing fixed-width transformers to efficiently implement Word RAM simulation within the same poly-logarithmic overhead.

For all three, the per-step overhead is polylogarithmic in the input size $\mathcal{O}(n^2)$ 1—specifically, $\mathcal{O}(n^2)$ 2 for "full" arithmetic instructions (multiplication/division) and $\mathcal{O}(n^2)$ 3 for simple instruction sets. These results sharply contrast previous TM-based CoT simulations, where the incurred overhead is always at least quadratic in $\mathcal{O}(n^2)$ 4.

Technical Contributions

The work’s simulation techniques incorporate several crucial design elements:

Unified Memory Representation: All program state (registers, memory, program counter) is collapsed into a random-access memory block, with state persisted in the CoT transcript through memory blocks encoding address–value pairs.
Attention Routing: For polylog-width transformers, equality over word-sized addresses is realized via inner products in the residual stream, and rightmost UHAT ensures random memory access can be simulated at low overhead.
Bit-Serialization Protocol: The challenge of representing word-sized values as binary tokens in low-width/fixed-width settings is handled through iterative serialization (extracting bits via division and modulo 2, handled either by continuous CoT persistent soft tokens or recurrent states).
Recursion Invariant: Bounded-value normalization ensures at every serialization step the operand fits the current transcript position, which is critical for iterative bit extraction and recovery.

These allow instantiating a simulation loop where each RAM step triggers a fixed sequence of layer and attention operations: load program counter, dereference operands (possibly nested), execute instruction, and commit memory updates—each mapped to a small number of CoT tokens and/or reasoning vectors.

Implications of the Results

Theoretical Implications

Expressivity vs. Efficiency: The work establishes that transformers, with minimal architectural extensions (rightmost UHAT or RNN layers), match the algorithmic efficiency of the Word RAM model within a polylogarithmic factor. This closes the gap between coarse universality/existence results and practical algorithmic simulation.
Minimal Sufficient Extensions: The necessity of rightmost UHAT, continuous CoT, or RNN recurrence for achieving efficient random access is highlighted; weaker attention mechanisms (e.g., leftmost UHAT, average hard attention alone) are shown to be insufficient for the efficient attention routing required for RAM simulation.

Practical Implications

Algorithmic Reasoning in LLMs: Models with explicit CoT or architectural augmentation (e.g., via recurrence or continuous, soft-state reasoning steps) are, in principle, capable of executing textbook algorithms with minimal overhead. This bolsters the theoretical foundation for using CoT-style prompting, architectural tweaks, or hybrid RNN-attention models in algorithmic LMs.
Model Architectures Choice: Fixed-width transformers are insufficient without additional mechanisms for random-access simulation; practical design of efficient algorithmic models will likely require recurrence (as in RWKV/RNN–Transformer hybrids) or forms of scratch-space/persistent vector tokens.

Limitations and Future Directions

Certain realism gaps remain. The constructions rely on idealized components (perfect hard attention, precise vector arithmetic, unbounded sequence length) and do not address sample efficiency, learning dynamics, or robustness under training. An open question remains as to what extent the minimal extensions identified are required in practice, and whether efficiently trainable transformers can approach these theoretical bounds.

Future Research Directions

Sample-Efficient Training: Extending these theoretical constructions to architectures and learning protocols that enable efficient training remains open.
Tradeoffs in Practical Modeling: Characterizing the overhead induced by more realistic constraints (soft, noisy, or approximate attention, finite-length context, imperfect precision) is an important next step.
Empirical Study of Algorithmic Generalization: Directly testing whether current architectures using proposed hybrid recurrence or continuous CoT can learn or generalize textbook algorithms with the predicted efficiency.

Conclusion

This work establishes that, for multiple classes of transformers and hybrid models leveraging chain-of-thought reasoning, Word RAM algorithms can be simulated with only polylogarithmic overhead per step—far more efficiently than prior Turing-completeness-based constructions. The results highlight the necessity of specific architectural extensions (rightmost UHAT, continuous CoT, or hybrid recurrence) to achieve efficient random access, and clarify the algorithmic boundaries of transformer expressivity under chain-of-thought reasoning. These insights inform both theoretical understanding and practical design of LLMs capable of efficient algorithmic reasoning.