Let's Think Dot by Dot: Hidden Computation in Transformer Language Models (2404.15758v1)

Published 24 Apr 2024 in cs.CL and cs.AI

Abstract: Chain-of-thought responses from LLMs improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about LLMs engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.

References (19)

Citations (34)

View on Semantic Scholar

Summary

The paper shows that filler tokens enable near-perfect performance on algorithmic tasks like 2SUM and 3SUM.
The paper demonstrates that intermediate filler tokens facilitate hidden parallel computation, boosting performance as task complexity grows.
The paper reveals that harnessing filler tokens requires dense, task-specific supervision to fully exploit transformers' computational potential.

Unveiling the Computational Power of Intermediate Tokens in Transformers

Introduction

Traditional approaches to transformer-based LLMs generally employ direct answers or utilize chain-of-thought reasoning to solve complex tasks. Recent empirical findings highlight a discrepancy between chain-of-thought responses and the intermediate reasoning steps within transformers, raising questions about their actual computation process. This paper critically examines the performance of transformers when using meaningless filler tokens, such as sequences of dots ('.....'), compared to functional chain-of-thought tokens in task solutions. Surprisingly, it shows that transformers can achieve high accuracy on specific tasks with these filler tokens, suggesting a departure from conventional understanding of token utility in LLMs.

Main Findings

The paper presents significant insights into the use of filler tokens within transformer models. Here are the key findings:

Performance with Filler Tokens: Despite their lack of meaningful content, filler tokens enable transformers to solve complex algorithmic tasks like 2SUM and 3SUM with perfect or near-perfect accuracy under certain conditions. This contrasts sharply with the performance of these models when no intermediate tokens are provided.
Mechanism of Action: Through a precise experimental setup, it's shown that transformations performed over the filler tokens contain hidden computations. These computations are crucial for solving tasks that would otherwise exceed a transformer's capability when using a direct approach or standard chain-of-thought without parallelizable hints.
Scale and Complexity Dependencies: The benefit of filler tokens becomes more pronounced as the scale of the problem increases. Specifically, as task complexity increases (either through longer input length or higher dimensional data), the model with filler tokens consistently outperforms the model that directly answers the question.
Expressivity within Computational Limits: The expressive power of transformers using filler tokens remains within the circuit complexity class $TC^0$ . Although filler tokens do not allow transformers to solve problems outside this class, they significantly enhance the model's power to address complex within-class problems by enabling parallelizable computation.
Learning from Dense Supervision: The findings also underscore the difficulty in training transformers to leverage filler tokens effectively. Dense, task-specific supervision is required, as models fail to learn adequately from filler tokens when only exposed to standard or adaptive chain-of-thought data.

Theoretical Implications

The use of non-informative or filler tokens pushes the boundary of what is considered utilizable computation in neural network-based models, particularly transformers. This observation raises theoretical questions about the limits of computation in such architectures and maps out an unforeseen route through which transformers can process information. Although these tokens do not contribute directly to the output, their management of the transformer's internal state is evidently sophisticated enough to influence the final task performance positively.

Future Directions

Given the surprising utility of filler tokens, future research should explore several avenues:

Richer Task Structures: Investigate whether the benefits of filler tokens observed in mathematical tasks translate effectively to more complex and varied natural language processing tasks.
Optimization and Learning: Develop learning algorithms and training regimes that can better harness the computational benefits of filler tokens without extensive supervision.
Architectural Innovations: Examine if modifications to the transformer architecture could further capitalize on the hidden computational capabilities highlighted by filler tokens.
Ethical and Transparency Considerations: Address the potential challenges that arise from using intermediate tokens whose computational contributions are not transparent, especially in high-stakes applications.

Conclusion

This paper provides robust empirical evidence and a theoretical framework that challenges the existing paradigms about intermediate token utility in transformers. The ability of transformers to utilize non-informative filler tokens to achieve high performance on complex tasks not only broadens our understanding of their computational capabilities but also sets the stage for innovative approaches to model design and deployment in AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jacob_pfau/status/1783951795238441449

https://twitter.com/gfodor/status/1784361570472247320

https://twitter.com/johnjnay/status/1784261779163349110

https://twitter.com/fly51fly/status/1783468447332208731

https://twitter.com/lambdaviking/status/1784570784830103594

https://twitter.com/BogdanIonutCir2/status/1832402095112089712