Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models (2404.15758v1)

Published 24 Apr 2024 in cs.CL and cs.AI

Abstract: Chain-of-thought responses from LLMs improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about LLMs engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Towards revealing the mystery behind chain of thought: A theoretical perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=qHrADgAdYu.
  2. Think before you speak: Training language models with pause tokens. 2024. URL https://arxiv.org/abs/2310.02226.
  3. Why are sensitive functions hard for transformers? 2024. URL https://arxiv.org/abs/2402.09963.
  4. Designing and Interpreting Probes with Control Tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2733–2743, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1275. URL https://aclanthology.org/D19-1275.
  5. Janus. How LLMs are and are not myopic. 2023. URL https://www.lesswrong.com/posts/c68SJsBpiAxkPwRHj/how-llms-are-and-are-not-myopic.
  6. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023. URL https://arxiv.org/abs/2307.13702.
  7. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023a. doi: 10.1162/tacl˙a˙00562. URL https://aclanthology.org/2023.tacl-1.31.
  8. A logic for expressing log-precision transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=uR8TtWCIsr.
  9. The expressive power of transformers with chain of thought, 2023c. URL https://arxiv.org/abs/2310.07923.
  10. Future Lens: Anticipating Subsequent Tokens from a Single Hidden State. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pp.  548–560, 2023. doi: 10.18653/v1/2023.conll-1.37. URL http://arxiv.org/abs/2311.04897. arXiv:2311.04897 [cs].
  11. Kshitij Sachan. Llms are (mostly) not helped by filler tokens, 2023. URL https://www.lesswrong.com/posts/oSZ2xTxEMZh9f3Yaz/llms-are-mostly-not-helped-by-filler-tokens.
  12. Representational strengths and limitations of transformers. Advances in Neural Information Processing Systems, 36, 2024. URL https://arxiv.org/abs/2306.02896.
  13. Transformers as recognizers of formal languages: A survey on expressivity, 2023. URL https://arxiv.org/abs/2311.00208.
  14. Challenging big-bench tasks and whether chain-of-thought can solve them. 2022. URL https://arxiv.org/abs/2210.09261.
  15. Llama: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971.
  16. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  74952–74965. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ed3fea9033a80fea1376299fa7863f4a-Paper-Conference.pdf.
  17. Chain-of-thought prompting elicits reasoning in large language models. 2023. URL https://arxiv.org/abs/2201.11903.
  18. Do language models plan ahead for future tokens?, March 2024. URL http://arxiv.org/abs/2404.00859. arXiv:2404.00859 [cs].
  19. Quiet-star: Language models can teach themselves to think before speaking. 2024. URL https://arxiv.org/abs/2403.09629.
Citations (34)

Summary

  • The paper shows that filler tokens enable near-perfect performance on algorithmic tasks like 2SUM and 3SUM.
  • The paper demonstrates that intermediate filler tokens facilitate hidden parallel computation, boosting performance as task complexity grows.
  • The paper reveals that harnessing filler tokens requires dense, task-specific supervision to fully exploit transformers' computational potential.

Unveiling the Computational Power of Intermediate Tokens in Transformers

Introduction

Traditional approaches to transformer-based LLMs generally employ direct answers or utilize chain-of-thought reasoning to solve complex tasks. Recent empirical findings highlight a discrepancy between chain-of-thought responses and the intermediate reasoning steps within transformers, raising questions about their actual computation process. This paper critically examines the performance of transformers when using meaningless filler tokens, such as sequences of dots ('.....'), compared to functional chain-of-thought tokens in task solutions. Surprisingly, it shows that transformers can achieve high accuracy on specific tasks with these filler tokens, suggesting a departure from conventional understanding of token utility in LLMs.

Main Findings

The paper presents significant insights into the use of filler tokens within transformer models. Here are the key findings:

  1. Performance with Filler Tokens: Despite their lack of meaningful content, filler tokens enable transformers to solve complex algorithmic tasks like 2SUM and 3SUM with perfect or near-perfect accuracy under certain conditions. This contrasts sharply with the performance of these models when no intermediate tokens are provided.
  2. Mechanism of Action: Through a precise experimental setup, it's shown that transformations performed over the filler tokens contain hidden computations. These computations are crucial for solving tasks that would otherwise exceed a transformer's capability when using a direct approach or standard chain-of-thought without parallelizable hints.
  3. Scale and Complexity Dependencies: The benefit of filler tokens becomes more pronounced as the scale of the problem increases. Specifically, as task complexity increases (either through longer input length or higher dimensional data), the model with filler tokens consistently outperforms the model that directly answers the question.
  4. Expressivity within Computational Limits: The expressive power of transformers using filler tokens remains within the circuit complexity class TC0TC^0. Although filler tokens do not allow transformers to solve problems outside this class, they significantly enhance the model's power to address complex within-class problems by enabling parallelizable computation.
  5. Learning from Dense Supervision: The findings also underscore the difficulty in training transformers to leverage filler tokens effectively. Dense, task-specific supervision is required, as models fail to learn adequately from filler tokens when only exposed to standard or adaptive chain-of-thought data.

Theoretical Implications

The use of non-informative or filler tokens pushes the boundary of what is considered utilizable computation in neural network-based models, particularly transformers. This observation raises theoretical questions about the limits of computation in such architectures and maps out an unforeseen route through which transformers can process information. Although these tokens do not contribute directly to the output, their management of the transformer's internal state is evidently sophisticated enough to influence the final task performance positively.

Future Directions

Given the surprising utility of filler tokens, future research should explore several avenues:

  • Richer Task Structures: Investigate whether the benefits of filler tokens observed in mathematical tasks translate effectively to more complex and varied natural language processing tasks.
  • Optimization and Learning: Develop learning algorithms and training regimes that can better harness the computational benefits of filler tokens without extensive supervision.
  • Architectural Innovations: Examine if modifications to the transformer architecture could further capitalize on the hidden computational capabilities highlighted by filler tokens.
  • Ethical and Transparency Considerations: Address the potential challenges that arise from using intermediate tokens whose computational contributions are not transparent, especially in high-stakes applications.

Conclusion

This paper provides robust empirical evidence and a theoretical framework that challenges the existing paradigms about intermediate token utility in transformers. The ability of transformers to utilize non-informative filler tokens to achieve high performance on complex tasks not only broadens our understanding of their computational capabilities but also sets the stage for innovative approaches to model design and deployment in AI systems.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com