Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs (2402.14903v1)
Abstract: Tokenization, the division of input text into input tokens, is an often overlooked aspect of the LLM pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and 3-digit numbers. In this work, we study the effect this choice has on numerical reasoning through the use of arithmetic tasks. We consider left-to-right and right-to-left tokenization for GPT-3.5 and -4, finding that right-to-left tokenization (enforced by comma separating numbers at inference time) leads to largely improved performance. Furthermore, we find that model errors when using standard left-to-right tokenization follow stereotyped error patterns, suggesting that model computations are systematic rather than approximate. We show that the model is able to convert between tokenizations easily, thus allowing chain-of-thought-inspired approaches to recover performance on left-to-right tokenized inputs. We also find the gap between tokenization directions decreases when models are scaled, possibly indicating that larger models are better able to override this tokenization-dependent inductive bias. In summary, our work performs the first study of how number tokenization choices lead to differences in model performance on arithmetic tasks, accompanied by a thorough analysis of error patterns. We hope this work inspires practitioners to more carefully ablate number tokenization-related choices when working towards general models of numerical reasoning.
- Exploring length generalization in large language models. Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=zSkYVeX7bC4.
- Byte pair encoding is suboptimal for language model pretraining. Empirical Methods in Natural Language Processing (EMNLP), 2020. URL https://aclanthology.org/2020.findings-emnlp.414.
- Language models are few-shot learners. Neural Information Processing Systems (NeurIPS), 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
- Training verifiers to solve math word problems. arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
- De Vries, H. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/.
- Rephrase and respond: Let large language models ask better questions for themselves. arXiv:2311.04205, 2023. URL https://arxiv.org/abs/2311.04205.
- Gage, P. A new algorithm for data compression. C Users Journal, 1994.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv:2101.00027, 2020. URL https://arxiv.org/abs/2101.00027.
- Gemini: A family of highly capable multimodal models, 2023. URL https://arxiv.org/abs/2312.11805.
- xval: A continuous number encoding for large language models. Neural Information Processing Systems (NeurIPS) AI for Science Workshop, 2023. URL https://openreview.net/forum?id=KHDMZtoF4i.
- Think before you speak: Training language models with pause tokens. International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=ph04CRkPdC.
- Olmo: Accelerating the science of language models, 2024.
- Measuring mathematical problem solving with the MATH dataset. Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe.
- Mistral 7b. arXiv:2310.06825, 2023. URL https://arxiv.org/abs/2310.06825.
- Large language models are zero-shot reasoners. Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf.
- Kudo, T. Subword regularization: Improving neural network translation models with multiple subword candidates. Association for Computational Linguistics (ACL), 2018. URL https://aclanthology.org/P18-1007.
- Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702, 2023. URL https://arxiv.org/abs/2307.13702.
- Teaching arithmetic to small transformers. arXiv:2307.03381, 2023. URL https://arxiv.org/abs/2307.03381.
- Solving quantitative reasoning problems with language models. Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=IFXTZERXdM7.
- Textbooks Are All You Need II: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
- Character-aware models improve visual text rendering. Association for Computational Linguistics (ACL), 2023. URL https://aclanthology.org/2023.acl-long.900.
- Lundberg, S. The art of prompt design: Prompt boundaries and token healing, 2023. URL https://towardsdatascience.com/the-art-of-prompt-design-prompt-boundaries-and-token-healing-3b2448b0be38.
- Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv:2309.13638, 2023. URL https://arxiv.org/abs/2309.13638.
- Evaluating transformer language models on arithmetic operations using number decomposition. Language Resources and Evaluation Conference (LREC), 2022. URL https://aclanthology.org/2022.lrec-1.30.
- Investigating the limitations of transformers with simple arithmetic tasks. arXiv:2102.13019, 2021. URL https://arxiv.org/abs/2102.13019.
- Show your work: Scratchpads for intermediate computation with language models. arXiv:2112.00114, 2021. URL https://arxiv.org/abs/2112.00114.
- GPT-4 Technical Report, 2023. URL https://arxiv.org/abs/2303.08774.
- Paster, K. Testing language models on a held-out high school national finals exam. https://huggingface.co/datasets/keirp/hungarian_national_hs_finals_exam, 2023.
- Impact of pretraining term frequencies on few-shot numerical reasoning. Empirical Methods in Natural Language Processing (EMNLP), 2022. URL https://aclanthology.org/2022.findings-emnlp.59.
- Rumbelow, J. and mwatkins. Solidgoldmagikarp (plus, prompt generation), 2023. URL https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation.
- Analysing mathematical reasoning abilities of neural models. International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=H1gR5iR5FX.
- Neural machine translation of rare words with subword units. Association for Computational Linguistics (ACL), 2016. URL https://aclanthology.org/P16-1162.
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. Empirical Methods in Natural Language Processing (EMNLP), 2020. URL https://aclanthology.org/2020.emnlp-main.346.
- A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https://openreview.net/forum?id=aB3Hwh4UzP.
- Tokenization consistency matters for generative models on extractive NLP tasks. Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https://aclanthology.org/2023.findings-emnlp.887.
- Teknium. How did the gpt tokenizer get created?, 2023. URL https://twitter.com/Teknium1/status/1634667026739527680?s=20.
- Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b. URL https://arxiv.org/abs/2307.09288.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, 2021. URL https://huggingface.co/EleutherAI/gpt-j-6b.
- Wei, J. Sorting a list of words by the second letter, 2023. URL https://x.com/_jasonwei/status/1661781746759909376?s=20.
- Chain of thought prompting elicits reasoning in large language models. Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
- MEGABYTE: Predicting million-byte sequences with multiscale transformers. Neural Information Processing Systems (NeurIPS), 2023. URL https://openreview.net/forum?id=JTmO2V9Xpz.
- Teaching algorithmic reasoning via in-context learning. arXiv:2211.09066, 2022. URL https://arxiv.org/abs/2211.09066.
- What algorithms can transformers learn? a study in length generalization. arXiv:2310.16028, 2023. URL https://arxiv.org/abs/2310.16028.
- Aaditya K. Singh (14 papers)
- DJ Strouse (15 papers)