Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs (2402.14903v1)

Published 22 Feb 2024 in cs.CL and cs.LG

Abstract: Tokenization, the division of input text into input tokens, is an often overlooked aspect of the LLM pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and 3-digit numbers. In this work, we study the effect this choice has on numerical reasoning through the use of arithmetic tasks. We consider left-to-right and right-to-left tokenization for GPT-3.5 and -4, finding that right-to-left tokenization (enforced by comma separating numbers at inference time) leads to largely improved performance. Furthermore, we find that model errors when using standard left-to-right tokenization follow stereotyped error patterns, suggesting that model computations are systematic rather than approximate. We show that the model is able to convert between tokenizations easily, thus allowing chain-of-thought-inspired approaches to recover performance on left-to-right tokenized inputs. We also find the gap between tokenization directions decreases when models are scaled, possibly indicating that larger models are better able to override this tokenization-dependent inductive bias. In summary, our work performs the first study of how number tokenization choices lead to differences in model performance on arithmetic tasks, accompanied by a thorough analysis of error patterns. We hope this work inspires practitioners to more carefully ablate number tokenization-related choices when working towards general models of numerical reasoning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Exploring length generalization in large language models. Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=zSkYVeX7bC4.
  2. Byte pair encoding is suboptimal for language model pretraining. Empirical Methods in Natural Language Processing (EMNLP), 2020. URL https://aclanthology.org/2020.findings-emnlp.414.
  3. Language models are few-shot learners. Neural Information Processing Systems (NeurIPS), 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  4. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  5. Training verifiers to solve math word problems. arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
  6. De Vries, H. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/.
  7. Rephrase and respond: Let large language models ask better questions for themselves. arXiv:2311.04205, 2023. URL https://arxiv.org/abs/2311.04205.
  8. Gage, P. A new algorithm for data compression. C Users Journal, 1994.
  9. The Pile: An 800gb dataset of diverse text for language modeling. arXiv:2101.00027, 2020. URL https://arxiv.org/abs/2101.00027.
  10. Gemini: A family of highly capable multimodal models, 2023. URL https://arxiv.org/abs/2312.11805.
  11. xval: A continuous number encoding for large language models. Neural Information Processing Systems (NeurIPS) AI for Science Workshop, 2023. URL https://openreview.net/forum?id=KHDMZtoF4i.
  12. Think before you speak: Training language models with pause tokens. International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=ph04CRkPdC.
  13. Olmo: Accelerating the science of language models, 2024.
  14. Measuring mathematical problem solving with the MATH dataset. Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe.
  15. Mistral 7b. arXiv:2310.06825, 2023. URL https://arxiv.org/abs/2310.06825.
  16. Large language models are zero-shot reasoners. Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf.
  17. Kudo, T. Subword regularization: Improving neural network translation models with multiple subword candidates. Association for Computational Linguistics (ACL), 2018. URL https://aclanthology.org/P18-1007.
  18. Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702, 2023. URL https://arxiv.org/abs/2307.13702.
  19. Teaching arithmetic to small transformers. arXiv:2307.03381, 2023. URL https://arxiv.org/abs/2307.03381.
  20. Solving quantitative reasoning problems with language models. Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=IFXTZERXdM7.
  21. Textbooks Are All You Need II: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  22. Character-aware models improve visual text rendering. Association for Computational Linguistics (ACL), 2023. URL https://aclanthology.org/2023.acl-long.900.
  23. Lundberg, S. The art of prompt design: Prompt boundaries and token healing, 2023. URL https://towardsdatascience.com/the-art-of-prompt-design-prompt-boundaries-and-token-healing-3b2448b0be38.
  24. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv:2309.13638, 2023. URL https://arxiv.org/abs/2309.13638.
  25. Evaluating transformer language models on arithmetic operations using number decomposition. Language Resources and Evaluation Conference (LREC), 2022. URL https://aclanthology.org/2022.lrec-1.30.
  26. Investigating the limitations of transformers with simple arithmetic tasks. arXiv:2102.13019, 2021. URL https://arxiv.org/abs/2102.13019.
  27. Show your work: Scratchpads for intermediate computation with language models. arXiv:2112.00114, 2021. URL https://arxiv.org/abs/2112.00114.
  28. GPT-4 Technical Report, 2023. URL https://arxiv.org/abs/2303.08774.
  29. Paster, K. Testing language models on a held-out high school national finals exam. https://huggingface.co/datasets/keirp/hungarian_national_hs_finals_exam, 2023.
  30. Impact of pretraining term frequencies on few-shot numerical reasoning. Empirical Methods in Natural Language Processing (EMNLP), 2022. URL https://aclanthology.org/2022.findings-emnlp.59.
  31. Rumbelow, J. and mwatkins. Solidgoldmagikarp (plus, prompt generation), 2023. URL https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation.
  32. Analysing mathematical reasoning abilities of neural models. International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=H1gR5iR5FX.
  33. Neural machine translation of rare words with subword units. Association for Computational Linguistics (ACL), 2016. URL https://aclanthology.org/P16-1162.
  34. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. Empirical Methods in Natural Language Processing (EMNLP), 2020. URL https://aclanthology.org/2020.emnlp-main.346.
  35. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https://openreview.net/forum?id=aB3Hwh4UzP.
  36. Tokenization consistency matters for generative models on extractive NLP tasks. Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https://aclanthology.org/2023.findings-emnlp.887.
  37. Teknium. How did the gpt tokenizer get created?, 2023. URL https://twitter.com/Teknium1/status/1634667026739527680?s=20.
  38. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b. URL https://arxiv.org/abs/2307.09288.
  40. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, 2021. URL https://huggingface.co/EleutherAI/gpt-j-6b.
  41. Wei, J. Sorting a list of words by the second letter, 2023. URL https://x.com/_jasonwei/status/1661781746759909376?s=20.
  42. Chain of thought prompting elicits reasoning in large language models. Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  43. MEGABYTE: Predicting million-byte sequences with multiscale transformers. Neural Information Processing Systems (NeurIPS), 2023. URL https://openreview.net/forum?id=JTmO2V9Xpz.
  44. Teaching algorithmic reasoning via in-context learning. arXiv:2211.09066, 2022. URL https://arxiv.org/abs/2211.09066.
  45. What algorithms can transformers learn? a study in length generalization. arXiv:2310.16028, 2023. URL https://arxiv.org/abs/2310.16028.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Aaditya K. Singh (14 papers)
  2. DJ Strouse (15 papers)
Citations (30)

Summary

Exploring the Impact of Tokenization on Arithmetic Performance in LLMs

Introduction to Tokenization in LLMs

Tokenization, which involves segmenting input text into discrete units called tokens, is a crucial yet often underexamined aspect of LLMs. This process not only impacts the model's architecture but also introduces inductive biases that can significantly affect its performance on tasks such as numerical reasoning. In an intriguing paper focused on state-of-the-art GPT models (GPT-3.5 and GPT-4), researchers have delved into how different tokenization schemes, specifically the direction of tokenization (left-to-right vs. right-to-left), influence model accuracy in arithmetic tasks.

Methodological Overview

The paper evaluates the performance of GPT-3.5 and GPT-4 models on few-shot arithmetic addition tasks, manipulating the tokenization direction of input numbers. It employs a control method where commas are introduced to enforce right-to-left (R2L) tokenization, contrasting this approach with the standard left-to-right (L2R) tokenization. The research further investigates several confounds such as the effect of adding "thinking tokens" and the use of different delimiters to enforce R2L tokenization. The findings are substantiated through a variety of controlled experiments and comprehensively analyzed error patterns.

Key Findings

  • Enhanced Performance with R2L Tokenization: Introducing commas to force R2L tokenization leads to a significant performance boost in arithmetic tasks, with accuracy improvements up to 20%. This effect is relatively consistent across different model scales and versions.
  • Systematic Error Patterns in L2R Tokenization: The analysis of error patterns reveals a pronounced and systematic degradation in performance when the answer length differs from the addends, specifically under L2R tokenization. This suggests that the tokenization direction can introduce systematic but flawed reasoning processes within the models.
  • Mitigating Performance Disparity: The paper presents a novel mitigation strategy, whereby models are prompted to convert problems from a less preferred tokenization (L2R) to a more preferred one (R2L). This approach significantly recuperates performance levels, illustrating the models' capability to adapt between tokenization formats.

Implications and Future Directions

The findings highlight that tokenization schemes can induce strong inductive biases in models, affecting tasks that require numerical reasoning. This has both practical and theoretical implications, suggesting that model practitioners should carefully consider tokenization strategies during model training and development. Additionally, the identification of systematic error patterns opens avenues for further research into the underlying algorithms and mechanisms that LLMs employ for arithmetic reasoning.

As models continue to evolve, understanding the subtle yet impactful role of tokenization becomes imperative. Future work could explore alternative tokenization methods, such as tokenizer-free models or continuous number encoding schemes, to mitigate tokenization-induced biases. Furthermore, conducting large-scale ablation studies with varying tokenization strategies could offer deeper insights into optimizing LLMs for numerical reasoning and beyond.

Acknowledgements and Contributions

This paper serves as a testament to the intricate dynamics of tokenization in influencing LLM performance. By providing a systematic examination of tokenization-dependent effects in GPT models, the research contributes significantly to the ongoing discourse on optimizing LLMs for complex reasoning tasks. The thoughtful design and execution of the experiments, coupled with comprehensive error pattern analyses, offer valuable insights that can guide future advancements in LLM development and application.

Youtube Logo Streamline Icon: https://streamlinehq.com