Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (2402.12875v4)

Published 20 Feb 2024 in cs.LG, cs.CC, and stat.ML

Abstract: Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of LLMs on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}0$, a proper subset of $ \mathsf{TC}0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.

Overview of "Chain of Thought Empowers Transformers to Solve Inherently Serial Problems"

The paper "Chain of Thought Empowers Transformers to Solve Inherently Serial Problems" by Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma provides a comprehensive examination of how Chain of Thought (CoT) reasoning augments the capabilities of decoder-only transformers. Specifically, the research explores understanding the expressiveness that CoT brings to transformer models, which traditionally excelled in parallel computation but struggled with inherently serial tasks.

Transformers are renowned for their efficacy in handling parallel computations due to their self-attention mechanism. However, this same architectural strength introduces limitations when facing tasks requiring serial computations. The paper theoretically and empirically demonstrates that incorporating CoT allows transformers to overcome these limitations by iteratively generating intermediate steps before arriving at the final answer.

Theoretical Insights

Expressiveness of Chain of Thought

The research hinges on the expressiveness of transformer models, which is intrinsically linked to their ability to handle complex computations. The authors use the language of circuit complexity to analyze this expressiveness, demonstrating that constant-depth transformers with and without CoT can solve different classes of problems.

  • Without CoT: Constant-depth transformers with constant-bit precision can only solve problems within the AC0\mathsf{AC}^0 complexity class. This limitation is due to their inherent lack of serial computational capabilities, which restricts them to parallel tasks.
  • With CoT: The introduction of CoT allows constant-depth transformers, even with constant-bit precision, to emulate larger classes of computations. In particular, with TT steps of CoT, these transformers can solve any problem solvable by boolean circuits of size TT.

This theoretical analysis culminates in showing that a constant-depth transformer with CoT, O(logn)O(\log n) embedding size, and constant-bit precision, aligns with the computational prowess of P/polyP/poly, making it capable of solving complex, serial tasks that go beyond the capabilities of TC0\mathsf{TC}^0 and AC0\mathsf{AC}^0 classes.

Empirical Validation

The empirical validation aligns with the theoretical insights. The paper evaluates transformer models on several inherently serial tasks:

  1. Modular Addition (CpC_p): This task involves adding numbers modulo a prime pp, where CoT substantially enhances the transformer’s performance on longer input sequences.
  2. Permutation Composition (S5S_5): A complex task where performance without CoT parallels random guessing, while transformers with CoT achieve significantly higher accuracy.
  3. Iterated Squaring: Known for its difficulty in parallel computation, this task benefits enormously from CoT, showcasing high prediction accuracy.
  4. Circuit Value Problem (CVP): A PP-complete problem, demonstrating the necessity of serial computation which CoT successfully provides.

Each of these tasks illustrates the marked improvements in transformer performance when inclusive of CoT, especially on problems that demand sequential computational steps.

Practical and Theoretical Implications

The implications of this work are profound for both practical applications and theoretical advancements:

  • Practical Implications: Transformer models augmented with CoT reasoning steps can handle a wider range of tasks more efficiently. This can be particularly beneficial in fields requiring complex symbolic reasoning and arithmetic operations, potentially improving performance in areas such as code generation or multi-step problem solving in AI.
  • Theoretical Implications: The linkage of CoT with computational complexity theory provides a robust framework for understanding and enhancing the capabilities of AI models. By delineating the problem classes that CoT-enabled transformers can solve, this research paves the way for targeted architectural improvements and more nuanced training methods.

Future Directions

Future research may delve into optimizing the architecture for even more efficient serial computations, exploring the balance between depth and breadth in transformer layers when using CoT, and assessing the impact of different types of CoT cues. Additionally, investigating the role of CoT in other transformer variants or multitask learning scenarios could provide further insights.

In conclusion, "Chain of Thought Empowers Transformers to Solve Inherently Serial Problems" not only provides a deep theoretical understanding of transformers’ enhanced capabilities with CoT but also empirically validates these theories through rigorous experimentation. The integration of CoT introduces a significant advancement in the expressiveness and applicability of transformer models, marking a pivotal step in the evolution of AI capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. David A. Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in nc. pp.  1–5, 1986.
  3. Unbounded fan-in circuits and associative functions. In Proceedings of the fifteenth annual ACM symposium on Theory of computing, pp.  52–60, 1983.
  4. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  5. Tighter bounds on the expressivity of transformer encoders. arXiv preprint arXiv:2301.10743, 2023.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  11. Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003, 2022.
  12. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pp.  5793–5831. PMLR, 2022.
  13. Towards revealing the mystery behind chain of thought: a theoretical perspective. arXiv preprint arXiv:2305.15408, 2023.
  14. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
  15. Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023.
  16. David Goldberg. What every computer scientist should know about floating-point arithmetic. ACM computing surveys (CSUR), 23(1):5–48, 1991.
  17. Google. Palm 2 technical report, 2023. URL https://ai.google/static/documents/palm2techreport.pdf.
  18. Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, 2020.
  19. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800–810, 2022.
  20. William Hesse. Division is in uniform tc0. In International Colloquium on Automata, Languages, and Programming, pp.  104–114. Springer, 2001.
  21. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  22. IEEE. Ieee standard for floating-point arithmetic. IEEE Std 754-2008, pp.  1–70, 2008. doi: 10.1109/IEEESTD.2008.4610935.
  23. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  24. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 2022.
  25. Algebraic theory of machines. i. prime decomposition theorem for finite semigroups and machines. Transactions of the American Mathematical Society, 116:450–464, 1965.
  26. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329, 2022.
  27. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 2022.
  28. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022a.
  29. Towards efficient and scalable sharpness-aware minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12360–12370, 2022b.
  30. Fiat-shamir for repeated squaring with applications to ppad-hardness and vdfs. In Advances in Cryptology–CRYPTO 2020: 40th Annual International Cryptology Conference, CRYPTO 2020, Santa Barbara, CA, USA, August 17–21, 2020, Proceedings, Part III, pp.  632–651. Springer, 2020.
  31. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022.
  32. Oded Maler. On the krohn-rhodes cascaded decomposition theorem. In Time for Verification: Essays in Memory of Amir Pnueli, pp.  260–278. Springer, 2010.
  33. Counter-Free Automata (MIT research monograph no. 65). The MIT Press, 1971.
  34. A logic for expressing log-precision transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
  35. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023b.
  36. On the power of saturated transformers: A view from circuit complexity. arXiv preprint arXiv:2106.16213, 2021.
  37. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856, 2022.
  38. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  39. OpenAI. Gpt-4 technical report, 2023.
  40. On the turing completeness of modern neural network architectures. arXiv preprint arXiv:1901.03429, 2019.
  41. Attention is turing complete. The Journal of Machine Learning Research, 22(1):3463–3497, 2021.
  42. Relations among complexity measures. Journal of the ACM (JACM), 26(2):361–381, 1979.
  43. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
  44. Time-lock puzzles and timed-release crypto. 1996.
  45. Language models are multilingual chain-of-thought reasoners. International Conference on Machine Learning, 2023.
  46. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  47. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950, 2019.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  49. Jesse Vig. Visualizing attention in transformer-based language representation models. arXiv preprint arXiv:1904.02679, 2019.
  50. Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001, 2022a.
  51. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022b.
  52. Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations (ICLR), 2023.
  53. Chain of thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 2022.
  54. Thinking like transformers. In International Conference on Machine Learning, pp.  11080–11090. PMLR, 2021.
  55. Christopher B Wilson. Relativized circuit complexity. Journal of Computer and System Sciences, 31(2):169–181, 1985.
  56. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp.  10524–10533. PMLR, 2020.
  57. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  58. Andrew Chi-Chih Yao. Circuits and local computation. pp.  186–196, 1989.
  59. Self-attention networks can process bounded hierarchical languages. arXiv preprint arXiv:2105.11115, 2021.
  60. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  61. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
  62. Least-to-most prompting enables complex reasoning in large language models. International Conference on Learning Representations (ICLR), 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhiyuan Li (304 papers)
  2. Hong Liu (394 papers)
  3. Denny Zhou (65 papers)
  4. Tengyu Ma (117 papers)
Citations (57)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com