Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The pitfalls of next-token prediction (2403.06963v2)

Published 11 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. Finally, we provide preliminary evidence that this failure can be resolved using a simple modification that predicts multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. Fractal patterns may unravel the intelligence in next-token prediction, 2024.
  2. Physics of language models: Part 3.2, knowledge manipulation. CoRR, abs/2309.14402, 2023. doi: 10.48550/ARXIV.2309.14402. URL https://doi.org/10.48550/arXiv.2309.14402.
  3. Exploring length generalization in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  4. Arkoudas, K. Chatgpt is no stochastic parrot. but it also claims that 1 is greater than 1. Philosophy & Technology, 36(3):54, 2023.
  5. On the role of bidirectionality in language model pre-training. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  3973–3985. Association for Computational Linguistics, 2022.
  6. An actor-critic algorithm for sequence prediction. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
  7. On the dangers of stochastic parrots: Can language models be too big? In Elish, M. C., Isaac, W., and Zemel, R. S. (eds.), FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021, pp.  610–623. ACM, 2021.
  8. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp.  1171–1179, 2015.
  9. Graph of thoughts: Solving elaborate problems with large language models. CoRR, abs/2308.09687, 2023. doi: 10.48550/ARXIV.2308.09687. URL https://doi.org/10.48550/arXiv.2308.09687.
  10. Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712, 2023. doi: 10.48550/ARXIV.2303.12712. URL https://doi.org/10.48550/arXiv.2303.12712.
  11. Memory transformer. arXiv preprint arXiv:2006.11527, 2020.
  12. Learning to search better than your teacher. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp.  2058–2066. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/changb15.html.
  13. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
  14. Search-based structured prediction. Mach. Learn., 75(3):297–325, 2009.
  15. Introduction to latent variable energy-based models: A path towards autonomous machine intelligence. CoRR, abs/2306.02572, 2023. doi: 10.48550/ARXIV.2306.02572. URL https://doi.org/10.48550/arXiv.2306.02572.
  16. Autoregressive modeling with lookahead attention. CoRR, abs/2305.12272, 2023a.
  17. A measure-theoretic characterization of tight language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9744–9770, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.543. URL https://aclanthology.org/2023.acl-long.543.
  18. Faith and fate: Limits of transformers on compositionality. CoRR, abs/2305.18654, 2023.
  19. Towards revealing the mystery behind chain of thought: a theoretical perspective. CoRR, abs/2305.15408, 2023.
  20. Glasmachers, T. Limits of end-to-end learning. In Zhang, M. and Noh, Y. (eds.), Proceedings of The 9th Asian Conference on Machine Learning, ACML 2017, Seoul, Korea, November 15-17, 2017, volume 77 of Proceedings of Machine Learning Research, pp.  17–32. PMLR, 2017. URL http://proceedings.mlr.press/v77/glasmachers17a.html.
  21. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  22. Diffuseq: Sequence to sequence text generation with diffusion models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=jQj-_rLVXsj.
  23. Professor forcing: A new algorithm for training recurrent networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp.  4601–4609, 2016.
  24. Think before you speak: Training language models with pause tokens. CoRR, abs/2310.02226, 2023.
  25. Mamba: Linear-time sequence modeling with selective state spaces, 2023.
  26. Non-autoregressive neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=B1l8BtlCb.
  27. Knowledge matters: Importance of prior information for optimization. J. Mach. Learn. Res., 17:8:1–8:32, 2016. URL http://jmlr.org/papers/v17/gulchere16a.html.
  28. Finding neurons in a haystack: Case studies with sparse probing. CoRR, abs/2305.01610, 2023. doi: 10.48550/ARXIV.2305.01610. URL https://doi.org/10.48550/arXiv.2305.01610.
  29. Teaching large language models to reason with reinforcement learning, 2024.
  30. Inner monologue: Embodied reasoning through planning with language models. In Liu, K., Kulic, D., and Ichnowski, J. (eds.), Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pp.  1769–1782. PMLR, 2022.
  31. Mistral 7b, 2023.
  32. Kääriäinen, M. Lower bounds for reductions. In Atomic Learning Workshop, 2006.
  33. Kahneman, D. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
  34. Why machine reading comprehension models learn shortcuts? In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pp.  989–1002. Association for Computational Linguistics, 2021.
  35. LeCun, Y. Do large language models need sensory grounding for meaning and understanding? University Lecture, 2024.
  36. Teaching arithmetic to small transformers. CoRR, abs/2307.03381, 2023.
  37. Mechanics of next token prediction with self-attention. In 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 2024.
  38. Limitations of autoregressive models and their alternatives. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tür, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y. (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  5147–5173. Association for Computational Linguistics, 2021.
  39. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Barzilay, R. and Kan, M. (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp.  158–167. Association for Computational Linguistics, 2017.
  40. Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=De4FYqjFueZ.
  41. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  42. Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse. CoRR, abs/2311.07468, 2023.
  43. Self-refine: Iterative refinement with self-feedback. CoRR, abs/2303.17651, 2023. doi: 10.48550/ARXIV.2303.17651. URL https://doi.org/10.48550/arXiv.2303.17651.
  44. Malach, E. Auto-regressive next-token predictors are universal learners. CoRR, abs/2309.06979, 2023. doi: 10.48550/ARXIV.2309.06979. URL https://doi.org/10.48550/arXiv.2309.06979.
  45. Embers of autoregression: Understanding large language models through the problem they are trained to solve. CoRR, abs/2309.13638, 2023.
  46. Locating and editing factual associations in GPT. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  47. The parallelism tradeoff: Limitations of log-precision transformers, 2023a.
  48. The expressive power of transformers with chain of thought. CoRR, abs/2310.07923, 2023b. doi: 10.48550/ARXIV.2310.07923. URL https://doi.org/10.48550/arXiv.2310.07923.
  49. Evaluating cognitive maps and planning in large language models with cogeval. CoRR, abs/2309.15129, 2023. doi: 10.48550/ARXIV.2309.15129. URL https://doi.org/10.48550/arXiv.2309.15129.
  50. Pass: Parallel speculative sampling. CoRR, abs/2311.13581, 2023. doi: 10.48550/ARXIV.2311.13581. URL https://doi.org/10.48550/arXiv.2311.13581.
  51. Show your work: Scratchpads for intermediate computation with language models. CoRR, abs/2112.00114, 2021. URL https://arxiv.org/abs/2112.00114.
  52. In-context learning and induction heads. abs/2209.11895, 2022. doi: 10.48550/ARXIV.2209.11895. URL https://doi.org/10.48550/arXiv.2209.11895.
  53. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  54. Future lens: Anticipating subsequent tokens from a single hidden state. In Jiang, J., Reitter, D., and Deng, S. (eds.), Proceedings of the 27th Conference on Computational Natural Language Learning, CoNLL 2023, Singapore, December 6-7, 2023, pp.  548–560. Association for Computational Linguistics, 2023.
  55. Arrows of time for large language models, 2024.
  56. A deep reinforced model for abstractive summarization. In 6th International Conference on Learning Representations, ICLR 2018, Conference Track Proceedings. OpenReview.net, 2018.
  57. Eliciting language model behaviors using reverse language models. In Socially Responsible Language Modelling Research, 2023. URL https://openreview.net/forum?id=m6xyTie61H.
  58. Clever Hans (the horse of Mr. Von Osten) a contribution to experimental animal and human psychology. New York, H. Holt and company, 1911. URL https://www.biodiversitylibrary.org/item/116908. https://www.biodiversitylibrary.org/bibliography/56164.
  59. Measuring and improving bert’s mathematical abilities by predicting the order of reasoning. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 2: Short Papers), Virtual Event, August 1-6, 2021, pp.  383–394. Association for Computational Linguistics, 2021.
  60. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022.
  61. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training. In Cohn, T., He, Y., and Liu, Y. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pp.  2401–2410.
  62. Language models are unsupervised multitask learners. 2019.
  63. Hans, are you clever? clever hans effect analysis of neural systems, 2023.
  64. Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06732.
  65. Recchia, G. Teaching autoregressive language models complex tasks by demonstration. CoRR, abs/2109.02102, 2021. URL https://arxiv.org/abs/2109.02102.
  66. Efficient reductions for imitation learning. In Teh, Y. W. and Titterington, D. M. (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, volume 9 of JMLR Proceedings, pp.  661–668. JMLR.org, 2010.
  67. Reinforcement and imitation learning via interactive no-regret learning. abs/1406.5979, 2014. URL http://arxiv.org/abs/1406.5979.
  68. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, volume 15 of JMLR Proceedings, pp.  627–635. JMLR.org, 2011.
  69. On the sample complexity of end-to-end training vs. semantic abstraction training. CoRR, abs/1604.06915, 2016. URL http://arxiv.org/abs/1604.06915.
  70. Failures of gradient-based deep learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp.  3067–3075. PMLR, 2017.
  71. Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x.
  72. Shannon, C. E. Prediction and entropy of printed english. The Bell System Technical Journal, 30(1):50–64, 1951. doi: 10.1002/j.1538-7305.1951.tb01366.x.
  73. Positional description matters for transformers arithmetic. CoRR, abs/2311.14737, 2023. doi: 10.48550/ARXIV.2311.14737. URL https://doi.org/10.48550/arXiv.2311.14737.
  74. Reflexion: Language agents with verbal reinforcement learning, 2023.
  75. Language models are better than humans at next-token prediction. CoRR, abs/2212.11281, 2022. doi: 10.48550/ARXIV.2212.11281. URL https://doi.org/10.48550/arXiv.2212.11281.
  76. Repetition improves language model embeddings, 2024.
  77. Blockwise parallel decoding for deep autoregressive models. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp.  10107–10116, 2018.
  78. Learning to summarize from human feedback. CoRR, abs/2009.01325, 2020. URL https://arxiv.org/abs/2009.01325.
  79. Thrampoulidis, C. Implicit bias of next-token prediction, 2024.
  80. Llama 2: Open foundation and fine-tuned chat models, 2023.
  81. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change, 2023.
  82. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017.
  83. Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  84. Consistency of a recurrent language model with respect to incomplete decoding. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), November 2020.
  85. Sub-task decomposition enables learning in sequence to sequence tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=BrJATVZDWEH.
  86. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280, 1989. doi: 10.1162/neco.1989.1.2.270.
  87. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144.
  88. Adaptive computation with elastic input sequence. In International Conference on Machine Learning, ICML 2023, Proceedings of Machine Learning Research. PMLR, 2023.
  89. Tree of thoughts: Deliberate problem solving with large language models. CoRR, abs/2305.10601, 2023a.
  90. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023b. URL https://openreview.net/pdf?id=WE_vluYUL-X.
  91. On the inconsistencies of conditionals learned by masked language models. CoRR, abs/2301.00068, 2023. URL https://doi.org/10.48550/arXiv.2301.00068.
  92. Star: Bootstrapping reasoning with reasoning. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/639a9a172c044fbb64175b5fad42e9a5-Abstract-Conference.html.
  93. Fine-tuning language models from human preferences. CoRR, abs/1909.08593, 2019. URL http://arxiv.org/abs/1909.08593.
Citations (35)

Summary

  • The paper demonstrates that teacher-forcing can impair complex planning by failing in lookahead tasks during next-token prediction.
  • It introduces a minimal path-finding example that exposes issues like the Clever Hans cheat and Indecipherable Token failure.
  • Empirical results on models like Transformers show that predicting multiple future tokens may mitigate these limitations.

Exploring the Limits of Next-Token Prediction in LLMs

The Distinct Phases of Next-Token Prediction

In recent years, next-token prediction (NTP) has become a central paradigm in training Generative LLMs (LMs), notably driving the success of models such as GPT-3. NTP is essentially the task of predicting the probability of the next token in a sequence, given all previous tokens. This procedure underpins both the training phase, through a mechanism known as teacher-forcing, and the inference phase, via autoregressive modeling. Despite the widespread adoption and success of this method, concerns linger regarding its efficacy for tasks requiring complex planning or reasoning.

At its core, the chain rule of probability assures us that any sequence generation task can be decomposed into a series of NTP tasks. However, a deep-rooted skepticism exists, revolving around the potential for errors to compound during autoregressive inference, thus questioning the model's capability for intricate planning tasks.

Unveiling a More Profound Issue

A deeper, yet less explored issue lies not in the autoregressive inference, but in the training phase itself—teacher-forcing. The consensus presumes that teacher-forcing effectively teaches the model to accurately predict the next token, making any shortcomings a matter of execution rather than learning. However, we posit that in certain "lookahead tasks," where predicting later tokens relies on previously imagined tokens not yet generated, teacher-forcing could fundamentally fail to grasp the required complex planning mechanisms.

The Path-Star Example: A Case of Inherent Failure

To crystallize this concern, we introduce a minimal task exemplifying the failure of NTP due to teacher-forcing. We conceptualize a directed graph-based path-finding problem, highlighting two endemic issues in teacher-forcing: the Clever Hans cheat and the Indecipherable Token failure. The former describes the model's reliance on spurious correlations present due to the exposure of partial ground truth during training, an effect that simplifies task learning but destroys the model's ability to generalize. The latter reflects the reduced supervision for critical parts of the task stemming from the model's reliance on Clever Hans shortcuts, significantly hindering the model's learning capabilities for these crucial elements.

Empirical Validation and Beyond

Our empirical investigations across different architectures (Transformers and Mamba) conclusively demonstrate the presence of these failure modes, even in the setting where the task is conceptually straightforward. Remarkably, alternative training objectives proposed to circumvent the standard teacher-forcing paradigm—specifically, those that prompt the model to predict several future tokens simultaneously—show potential in overcoming these failures, albeit in limited settings.

Reflecting on Next-Token Prediction's Future

The failure of NTP in even simple scenarios raises pertinent questions regarding its efficacy for more complex, real-world tasks, such as creative writing or advanced reasoning. This insight beckons the exploration of alternatives to NTP, prompting both theoretical introspection and empirical investigations to better understand and enhance models' planning and generalization capabilities.

As we venture into this exploration, the lessons from the path-star example and the notion of teacherless training present a promising avenue, suggesting a shift towards training paradigms that inherently encourage models to learn complex planning without the pitfalls associated with NTP. It is an open invitation for the community to explore the mechanisms of LLM training, pushing the boundaries of what these remarkable systems can achieve.

Youtube Logo Streamline Icon: https://streamlinehq.com