Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (2402.12875v4)

Published 20 Feb 2024 in cs.LG, cs.CC, and stat.ML

Abstract: Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of LLMs on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.

PDF HTML Abstract

Overview of "Chain of Thought Empowers Transformers to Solve Inherently Serial Problems"

The paper "Chain of Thought Empowers Transformers to Solve Inherently Serial Problems" by Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma provides a comprehensive examination of how Chain of Thought (CoT) reasoning augments the capabilities of decoder-only transformers. Specifically, the research explores understanding the expressiveness that CoT brings to transformer models, which traditionally excelled in parallel computation but struggled with inherently serial tasks.

Transformers are renowned for their efficacy in handling parallel computations due to their self-attention mechanism. However, this same architectural strength introduces limitations when facing tasks requiring serial computations. The paper theoretically and empirically demonstrates that incorporating CoT allows transformers to overcome these limitations by iteratively generating intermediate steps before arriving at the final answer.

Theoretical Insights

Expressiveness of Chain of Thought

The research hinges on the expressiveness of transformer models, which is intrinsically linked to their ability to handle complex computations. The authors use the language of circuit complexity to analyze this expressiveness, demonstrating that constant-depth transformers with and without CoT can solve different classes of problems.

Without CoT: Constant-depth transformers with constant-bit precision can only solve problems within the $\mathsf{AC}^0$ complexity class. This limitation is due to their inherent lack of serial computational capabilities, which restricts them to parallel tasks.
With CoT: The introduction of CoT allows constant-depth transformers, even with constant-bit precision, to emulate larger classes of computations. In particular, with $T$ steps of CoT, these transformers can solve any problem solvable by boolean circuits of size $T$ .

This theoretical analysis culminates in showing that a constant-depth transformer with CoT, $O(\log n)$ embedding size, and constant-bit precision, aligns with the computational prowess of $P/poly$ , making it capable of solving complex, serial tasks that go beyond the capabilities of $\mathsf{TC}^0$ and $\mathsf{AC}^0$ classes.

Empirical Validation

The empirical validation aligns with the theoretical insights. The paper evaluates transformer models on several inherently serial tasks:

Modular Addition ( $C_p$ ): This task involves adding numbers modulo a prime $p$ , where CoT substantially enhances the transformer’s performance on longer input sequences.
Permutation Composition ( $S_5$ ): A complex task where performance without CoT parallels random guessing, while transformers with CoT achieve significantly higher accuracy.
Iterated Squaring: Known for its difficulty in parallel computation, this task benefits enormously from CoT, showcasing high prediction accuracy.
Circuit Value Problem (CVP): A $P$ -complete problem, demonstrating the necessity of serial computation which CoT successfully provides.

Each of these tasks illustrates the marked improvements in transformer performance when inclusive of CoT, especially on problems that demand sequential computational steps.

Practical and Theoretical Implications

The implications of this work are profound for both practical applications and theoretical advancements:

Practical Implications: Transformer models augmented with CoT reasoning steps can handle a wider range of tasks more efficiently. This can be particularly beneficial in fields requiring complex symbolic reasoning and arithmetic operations, potentially improving performance in areas such as code generation or multi-step problem solving in AI.
Theoretical Implications: The linkage of CoT with computational complexity theory provides a robust framework for understanding and enhancing the capabilities of AI models. By delineating the problem classes that CoT-enabled transformers can solve, this research paves the way for targeted architectural improvements and more nuanced training methods.

Future Directions

Future research may delve into optimizing the architecture for even more efficient serial computations, exploring the balance between depth and breadth in transformer layers when using CoT, and assessing the impact of different types of CoT cues. Additionally, investigating the role of CoT in other transformer variants or multitask learning scenarios could provide further insights.

In conclusion, "Chain of Thought Empowers Transformers to Solve Inherently Serial Problems" not only provides a deep theoretical understanding of transformers’ enhanced capabilities with CoT but also empirically validates these theories through rigorous experimentation. The integration of CoT introduces a significant advancement in the expressiveness and applicability of transformer models, marking a pivotal step in the evolution of AI capabilities.

PDF Markdown Bookmark Chat (Pro)

References (62)

Authors (4)

Zhiyuan Li (304 papers)
Hong Liu (394 papers)
Denny Zhou (65 papers)
Tengyu Ma (117 papers)

Citations (57)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/denny_zhou/status/1835761801453306089

https://twitter.com/rohanpaul_ai/status/1835868073666207863

https://twitter.com/zhiyuanli_/status/1788173699201327383

https://twitter.com/_xjdr/status/1836064975187857437

https://twitter.com/stochasticchasm/status/1882830757158351310

https://twitter.com/cosminnegruseri/status/1861983750088954059