The pitfalls of next-token prediction (2403.06963v2)
Abstract: Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. Finally, we provide preliminary evidence that this failure can be resolved using a simple modification that predicts multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures
- Fractal patterns may unravel the intelligence in next-token prediction, 2024.
- Physics of language models: Part 3.2, knowledge manipulation. CoRR, abs/2309.14402, 2023. doi: 10.48550/ARXIV.2309.14402. URL https://doi.org/10.48550/arXiv.2309.14402.
- Exploring length generalization in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Arkoudas, K. Chatgpt is no stochastic parrot. but it also claims that 1 is greater than 1. Philosophy & Technology, 36(3):54, 2023.
- On the role of bidirectionality in language model pre-training. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 3973–3985. Association for Computational Linguistics, 2022.
- An actor-critic algorithm for sequence prediction. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
- On the dangers of stochastic parrots: Can language models be too big? In Elish, M. C., Isaac, W., and Zemel, R. S. (eds.), FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021, pp. 610–623. ACM, 2021.
- Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1171–1179, 2015.
- Graph of thoughts: Solving elaborate problems with large language models. CoRR, abs/2308.09687, 2023. doi: 10.48550/ARXIV.2308.09687. URL https://doi.org/10.48550/arXiv.2308.09687.
- Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712, 2023. doi: 10.48550/ARXIV.2303.12712. URL https://doi.org/10.48550/arXiv.2303.12712.
- Memory transformer. arXiv preprint arXiv:2006.11527, 2020.
- Learning to search better than your teacher. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp. 2058–2066. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/changb15.html.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
- Search-based structured prediction. Mach. Learn., 75(3):297–325, 2009.
- Introduction to latent variable energy-based models: A path towards autonomous machine intelligence. CoRR, abs/2306.02572, 2023. doi: 10.48550/ARXIV.2306.02572. URL https://doi.org/10.48550/arXiv.2306.02572.
- Autoregressive modeling with lookahead attention. CoRR, abs/2305.12272, 2023a.
- A measure-theoretic characterization of tight language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9744–9770, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.543. URL https://aclanthology.org/2023.acl-long.543.
- Faith and fate: Limits of transformers on compositionality. CoRR, abs/2305.18654, 2023.
- Towards revealing the mystery behind chain of thought: a theoretical perspective. CoRR, abs/2305.15408, 2023.
- Glasmachers, T. Limits of end-to-end learning. In Zhang, M. and Noh, Y. (eds.), Proceedings of The 9th Asian Conference on Machine Learning, ACML 2017, Seoul, Korea, November 15-17, 2017, volume 77 of Proceedings of Machine Learning Research, pp. 17–32. PMLR, 2017. URL http://proceedings.mlr.press/v77/glasmachers17a.html.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Diffuseq: Sequence to sequence text generation with diffusion models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=jQj-_rLVXsj.
- Professor forcing: A new algorithm for training recurrent networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4601–4609, 2016.
- Think before you speak: Training language models with pause tokens. CoRR, abs/2310.02226, 2023.
- Mamba: Linear-time sequence modeling with selective state spaces, 2023.
- Non-autoregressive neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=B1l8BtlCb.
- Knowledge matters: Importance of prior information for optimization. J. Mach. Learn. Res., 17:8:1–8:32, 2016. URL http://jmlr.org/papers/v17/gulchere16a.html.
- Finding neurons in a haystack: Case studies with sparse probing. CoRR, abs/2305.01610, 2023. doi: 10.48550/ARXIV.2305.01610. URL https://doi.org/10.48550/arXiv.2305.01610.
- Teaching large language models to reason with reinforcement learning, 2024.
- Inner monologue: Embodied reasoning through planning with language models. In Liu, K., Kulic, D., and Ichnowski, J. (eds.), Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pp. 1769–1782. PMLR, 2022.
- Mistral 7b, 2023.
- Kääriäinen, M. Lower bounds for reductions. In Atomic Learning Workshop, 2006.
- Kahneman, D. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
- Why machine reading comprehension models learn shortcuts? In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pp. 989–1002. Association for Computational Linguistics, 2021.
- LeCun, Y. Do large language models need sensory grounding for meaning and understanding? University Lecture, 2024.
- Teaching arithmetic to small transformers. CoRR, abs/2307.03381, 2023.
- Mechanics of next token prediction with self-attention. In 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 2024.
- Limitations of autoregressive models and their alternatives. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tür, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y. (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp. 5147–5173. Association for Computational Linguistics, 2021.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Barzilay, R. and Kan, M. (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 158–167. Association for Computational Linguistics, 2017.
- Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=De4FYqjFueZ.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse. CoRR, abs/2311.07468, 2023.
- Self-refine: Iterative refinement with self-feedback. CoRR, abs/2303.17651, 2023. doi: 10.48550/ARXIV.2303.17651. URL https://doi.org/10.48550/arXiv.2303.17651.
- Malach, E. Auto-regressive next-token predictors are universal learners. CoRR, abs/2309.06979, 2023. doi: 10.48550/ARXIV.2309.06979. URL https://doi.org/10.48550/arXiv.2309.06979.
- Embers of autoregression: Understanding large language models through the problem they are trained to solve. CoRR, abs/2309.13638, 2023.
- Locating and editing factual associations in GPT. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- The parallelism tradeoff: Limitations of log-precision transformers, 2023a.
- The expressive power of transformers with chain of thought. CoRR, abs/2310.07923, 2023b. doi: 10.48550/ARXIV.2310.07923. URL https://doi.org/10.48550/arXiv.2310.07923.
- Evaluating cognitive maps and planning in large language models with cogeval. CoRR, abs/2309.15129, 2023. doi: 10.48550/ARXIV.2309.15129. URL https://doi.org/10.48550/arXiv.2309.15129.
- Pass: Parallel speculative sampling. CoRR, abs/2311.13581, 2023. doi: 10.48550/ARXIV.2311.13581. URL https://doi.org/10.48550/arXiv.2311.13581.
- Show your work: Scratchpads for intermediate computation with language models. CoRR, abs/2112.00114, 2021. URL https://arxiv.org/abs/2112.00114.
- In-context learning and induction heads. abs/2209.11895, 2022. doi: 10.48550/ARXIV.2209.11895. URL https://doi.org/10.48550/arXiv.2209.11895.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Future lens: Anticipating subsequent tokens from a single hidden state. In Jiang, J., Reitter, D., and Deng, S. (eds.), Proceedings of the 27th Conference on Computational Natural Language Learning, CoNLL 2023, Singapore, December 6-7, 2023, pp. 548–560. Association for Computational Linguistics, 2023.
- Arrows of time for large language models, 2024.
- A deep reinforced model for abstractive summarization. In 6th International Conference on Learning Representations, ICLR 2018, Conference Track Proceedings. OpenReview.net, 2018.
- Eliciting language model behaviors using reverse language models. In Socially Responsible Language Modelling Research, 2023. URL https://openreview.net/forum?id=m6xyTie61H.
- Clever Hans (the horse of Mr. Von Osten) a contribution to experimental animal and human psychology. New York, H. Holt and company, 1911. URL https://www.biodiversitylibrary.org/item/116908. https://www.biodiversitylibrary.org/bibliography/56164.
- Measuring and improving bert’s mathematical abilities by predicting the order of reasoning. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 2: Short Papers), Virtual Event, August 1-6, 2021, pp. 383–394. Association for Computational Linguistics, 2021.
- Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022.
- Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training. In Cohn, T., He, Y., and Liu, Y. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pp. 2401–2410.
- Language models are unsupervised multitask learners. 2019.
- Hans, are you clever? clever hans effect analysis of neural systems, 2023.
- Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06732.
- Recchia, G. Teaching autoregressive language models complex tasks by demonstration. CoRR, abs/2109.02102, 2021. URL https://arxiv.org/abs/2109.02102.
- Efficient reductions for imitation learning. In Teh, Y. W. and Titterington, D. M. (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, volume 9 of JMLR Proceedings, pp. 661–668. JMLR.org, 2010.
- Reinforcement and imitation learning via interactive no-regret learning. abs/1406.5979, 2014. URL http://arxiv.org/abs/1406.5979.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, volume 15 of JMLR Proceedings, pp. 627–635. JMLR.org, 2011.
- On the sample complexity of end-to-end training vs. semantic abstraction training. CoRR, abs/1604.06915, 2016. URL http://arxiv.org/abs/1604.06915.
- Failures of gradient-based deep learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 3067–3075. PMLR, 2017.
- Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x.
- Shannon, C. E. Prediction and entropy of printed english. The Bell System Technical Journal, 30(1):50–64, 1951. doi: 10.1002/j.1538-7305.1951.tb01366.x.
- Positional description matters for transformers arithmetic. CoRR, abs/2311.14737, 2023. doi: 10.48550/ARXIV.2311.14737. URL https://doi.org/10.48550/arXiv.2311.14737.
- Reflexion: Language agents with verbal reinforcement learning, 2023.
- Language models are better than humans at next-token prediction. CoRR, abs/2212.11281, 2022. doi: 10.48550/ARXIV.2212.11281. URL https://doi.org/10.48550/arXiv.2212.11281.
- Repetition improves language model embeddings, 2024.
- Blockwise parallel decoding for deep autoregressive models. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 10107–10116, 2018.
- Learning to summarize from human feedback. CoRR, abs/2009.01325, 2020. URL https://arxiv.org/abs/2009.01325.
- Thrampoulidis, C. Implicit bias of next-token prediction, 2024.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change, 2023.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017.
- Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Consistency of a recurrent language model with respect to incomplete decoding. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), November 2020.
- Sub-task decomposition enables learning in sequence to sequence tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=BrJATVZDWEH.
- A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280, 1989. doi: 10.1162/neco.1989.1.2.270.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144.
- Adaptive computation with elastic input sequence. In International Conference on Machine Learning, ICML 2023, Proceedings of Machine Learning Research. PMLR, 2023.
- Tree of thoughts: Deliberate problem solving with large language models. CoRR, abs/2305.10601, 2023a.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023b. URL https://openreview.net/pdf?id=WE_vluYUL-X.
- On the inconsistencies of conditionals learned by masked language models. CoRR, abs/2301.00068, 2023. URL https://doi.org/10.48550/arXiv.2301.00068.
- Star: Bootstrapping reasoning with reasoning. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/639a9a172c044fbb64175b5fad42e9a5-Abstract-Conference.html.
- Fine-tuning language models from human preferences. CoRR, abs/1909.08593, 2019. URL http://arxiv.org/abs/1909.08593.