The Expressive Power of Transformers with Chain of Thought (2310.07923v5)

Published 11 Oct 2023 in cs.LG, cs.CC, cs.CL, and cs.LO

Abstract: Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.

PDF HTML Abstract

An Analytical Review: The Expressive Power of Transformers with Chain of Thought

The paper "The Expressive Power of Transformers with Chain of Thought" presents a rigorous theoretical analysis of the computational capabilities of transformer models when augmented with intermediate steps of reasoning, denoted as a "chain of thought" (CoT). This technique allows transformers to process intermediate tokens before generating the final output, and the central inquiry of this work is to determine how this impacts the computational power of decoder-only transformer models.

Theoretical Foundations and Questions

Recent studies have exposed limitations in standard transformers that operate by directly transforming inputs to outputs without intermediate processing steps. These limitations are evident in problems like finite-state machine simulation or graph connectivity, which require forms of sequential reasoning that these models struggle to handle efficiently. The key innovation proposed by the authors is allowing transformers to condition on sequences of intermediate tokens, hypothesized to offer a significant increase in computational power.

To this end, the paper rigorously defines the conditions under which intermediate steps provide augmented computational power. It investigates whether such steps can bridge the gap between the computational capabilities of transformers and complex, sequential reasoning tasks that are otherwise challenging.

Main Findings

Impact of Step Complexity: The paper establishes that the computational power of a transformer is fundamentally influenced by the number of intermediate steps, $t(n)$ $t (n)$ , taken as a function of input size $n$ $n$ .
- With logarithmic intermediate steps, transformers only slightly extend their capabilities beyond the $TC^0$ class, failing to solve even relatively straightforward problems outside this class.
- Linear intermediate steps enable transformers to simulate any regular language, thus fundamentally boosting their power to handle automata-level tasks.
- Polynomial steps make the extended transformers equivalent to polynomial-time solvable problems, reaching a significant milestone: aligning with the complexity class $$.
Lower and Upper Bounds: The authors provide both lower bounds (using log precision and multi-pre-norm for constructions) and upper bounds (based on time and space complexity theories) on the computational power of these transformer models. The crux of their argument is that a certain number of intermediate steps allows the transformer to effectively simulate a Turing machine for polynomially-bounded steps, which is a more expressive class than previously possible with standard transformers.
Saturation and Pre-Norm Transformations: The derivations employ architectural assumptions like strict causal masking and generalized pre-norm, specifically illustrating how these features can be leveraged to achieve computational completeness over particular classes of problems.

Implications and Future Directions

The authors’ findings have substantial implications for the design and application of transformer models, particularly in areas requiring complex, hierarchical reasoning. Practically, transformers equipped with CoT can be expected to perform more robustly on tasks that inherently involve sequential logic. Theoretical implications suggest a paradigm where architectural innovations, such as CoT, can powerfully enhance learning systems to tackle broader problem sets that necessitate procedural reasoning.

Future research may pivot towards exploring ways to practically harness these theoretical advancements, investigating the learning dynamics under CoT, and developing strategies for effective training to unlock the empirically demonstrated benefits. Additionally, as the authors speculate, there is a promising outlook on developing learning-theoretical analyses that consider CoT, which might profoundly impact machine learning methodologies.

Overall, this paper provides a detailed formal framework to understand how transformer limitations on sequential reasoning may be mitigated through thoughtful architectural enhancements. Such insights could catalyze a new wave of research into models capable of handling tasks that demand deeper and more complex forms of cognition.