Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Expressive Power of Transformers with Chain of Thought (2310.07923v5)

Published 11 Oct 2023 in cs.LG, cs.CC, cs.CL, and cs.LO

Abstract: Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.

An Analytical Review: The Expressive Power of Transformers with Chain of Thought

The paper "The Expressive Power of Transformers with Chain of Thought" presents a rigorous theoretical analysis of the computational capabilities of transformer models when augmented with intermediate steps of reasoning, denoted as a "chain of thought" (CoT). This technique allows transformers to process intermediate tokens before generating the final output, and the central inquiry of this work is to determine how this impacts the computational power of decoder-only transformer models.

Theoretical Foundations and Questions

Recent studies have exposed limitations in standard transformers that operate by directly transforming inputs to outputs without intermediate processing steps. These limitations are evident in problems like finite-state machine simulation or graph connectivity, which require forms of sequential reasoning that these models struggle to handle efficiently. The key innovation proposed by the authors is allowing transformers to condition on sequences of intermediate tokens, hypothesized to offer a significant increase in computational power.

To this end, the paper rigorously defines the conditions under which intermediate steps provide augmented computational power. It investigates whether such steps can bridge the gap between the computational capabilities of transformers and complex, sequential reasoning tasks that are otherwise challenging.

Main Findings

  1. Impact of Step Complexity: The paper establishes that the computational power of a transformer is fundamentally influenced by the number of intermediate steps, t(n)t(n), taken as a function of input size nn.
    • With logarithmic intermediate steps, transformers only slightly extend their capabilities beyond the TC0TC^0 class, failing to solve even relatively straightforward problems outside this class.
    • Linear intermediate steps enable transformers to simulate any regular language, thus fundamentally boosting their power to handle automata-level tasks.
    • Polynomial steps make the extended transformers equivalent to polynomial-time solvable problems, reaching a significant milestone: aligning with the complexity class $$.
  2. Lower and Upper Bounds: The authors provide both lower bounds (using log precision and multi-pre-norm for constructions) and upper bounds (based on time and space complexity theories) on the computational power of these transformer models. The crux of their argument is that a certain number of intermediate steps allows the transformer to effectively simulate a Turing machine for polynomially-bounded steps, which is a more expressive class than previously possible with standard transformers.
  3. Saturation and Pre-Norm Transformations: The derivations employ architectural assumptions like strict causal masking and generalized pre-norm, specifically illustrating how these features can be leveraged to achieve computational completeness over particular classes of problems.

Implications and Future Directions

The authors’ findings have substantial implications for the design and application of transformer models, particularly in areas requiring complex, hierarchical reasoning. Practically, transformers equipped with CoT can be expected to perform more robustly on tasks that inherently involve sequential logic. Theoretical implications suggest a paradigm where architectural innovations, such as CoT, can powerfully enhance learning systems to tackle broader problem sets that necessitate procedural reasoning.

Future research may pivot towards exploring ways to practically harness these theoretical advancements, investigating the learning dynamics under CoT, and developing strategies for effective training to unlock the empirically demonstrated benefits. Additionally, as the authors speculate, there is a promising outlook on developing learning-theoretical analyses that consider CoT, which might profoundly impact machine learning methodologies.

Overall, this paper provides a detailed formal framework to understand how transformer limitations on sequential reasoning may be mitigated through thoughtful architectural enhancements. Such insights could catalyze a new wave of research into models capable of handling tasks that demand deeper and more complex forms of cognition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Tighter bounds on the expressivity of transformer encoders. ICML, 2023.
  2. Faith and fate: Limits of transformers on compositionality. In NeurIPS, 2023.
  3. Towards revealing the mystery behind chain of thought: A theoretical perspective, 2023.
  4. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. TACL, 10:800–810, 2022.
  5. On time versus space. J. ACM, 24:332–337, 1977.
  6. Introduction to automata theory, languages, and computation. ACM Sigact News, 32(1):60–65, 2001.
  7. S-Y Kuroda. Classes of languages and linear-bounded automata. Information and control, 7(2):207–223, 1964.
  8. Lillian Lee. Fast context-free grammar parsing requires fast boolean matrix multiplication. J. ACM, 49(1):1–15, Jan 2002.
  9. Transformers learn shortcuts to automata. In ICLR, 2023.
  10. William Merrill. On the linguistic capacity of real-time counter automata. ArXiv, abs/2004.06866, 2020.
  11. William Merrill. Formal languages and neural models for learning on sequences. In François Coste, Faissal Ouardi, and Guillaume Rabusseau (eds.), ICGI, volume 217 of PMLR, Jul 2023.
  12. A logic for expressing log-precision transformers. In NeurIPS, 2023a.
  13. The parallelism tradeoff: Limitations of log-precision transformers. TACL, 11:531–545, 2023b.
  14. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In EMNLP, 2021.
  15. Saturated transformers are constant-depth threshold circuits. TACL, 10:843–856, 2022.
  16. Show your work: Scratchpads for intermediate computation with language models. ArXiv, abs/2112.00114, 2021.
  17. Attention is Turing complete. JMLR, 22(1), January 2021.
  18. Dale Schuurmans. Memory augmented large language models are computationally universal. ArXiv, abs/2301.04589, 2023.
  19. Leslie G. Valiant. General context-free recognition in less than cubic time. Journal of Computer and System Sciences, 10(2):308–315, 1975.
  20. Attention is all you need. In NeurIPS, 2017.
  21. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), NeurIPS, 2022.
  22. On the practical computational power of finite precision RNNs for language recognition. In ACL, July 2018.
  23. Thinking like transformers. In ICML, 2021.
  24. Avi Wigderson. The complexity of graph connectivity. In International Symposium on Mathematical Foundations of Computer Science, pp.  112–132. Springer, 1992.
  25. On layer normalization in the transformer architecture. In ICML, 2020.
  26. Self-attention networks can process bounded hierarchical languages. In ACL, 2021.
  27. How language model hallucinations can snowball. ArXiv, abs/2305.13534, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. William Merrill (36 papers)
  2. Ashish Sabharwal (84 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com