Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs (2402.14903v1)

Published 22 Feb 2024 in cs.CL and cs.LG

Abstract: Tokenization, the division of input text into input tokens, is an often overlooked aspect of the LLM pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and 3-digit numbers. In this work, we study the effect this choice has on numerical reasoning through the use of arithmetic tasks. We consider left-to-right and right-to-left tokenization for GPT-3.5 and -4, finding that right-to-left tokenization (enforced by comma separating numbers at inference time) leads to largely improved performance. Furthermore, we find that model errors when using standard left-to-right tokenization follow stereotyped error patterns, suggesting that model computations are systematic rather than approximate. We show that the model is able to convert between tokenizations easily, thus allowing chain-of-thought-inspired approaches to recover performance on left-to-right tokenized inputs. We also find the gap between tokenization directions decreases when models are scaled, possibly indicating that larger models are better able to override this tokenization-dependent inductive bias. In summary, our work performs the first study of how number tokenization choices lead to differences in model performance on arithmetic tasks, accompanied by a thorough analysis of error patterns. We hope this work inspires practitioners to more carefully ablate number tokenization-related choices when working towards general models of numerical reasoning.

References (45)

Authors (2)

Aaditya K. Singh (14 papers)
DJ Strouse (15 papers)

Citations (30)

View on Semantic Scholar

Summary

Exploring the Impact of Tokenization on Arithmetic Performance in LLMs

Introduction to Tokenization in LLMs

Tokenization, which involves segmenting input text into discrete units called tokens, is a crucial yet often underexamined aspect of LLMs. This process not only impacts the model's architecture but also introduces inductive biases that can significantly affect its performance on tasks such as numerical reasoning. In an intriguing paper focused on state-of-the-art GPT models (GPT-3.5 and GPT-4), researchers have delved into how different tokenization schemes, specifically the direction of tokenization (left-to-right vs. right-to-left), influence model accuracy in arithmetic tasks.

Methodological Overview

The paper evaluates the performance of GPT-3.5 and GPT-4 models on few-shot arithmetic addition tasks, manipulating the tokenization direction of input numbers. It employs a control method where commas are introduced to enforce right-to-left (R2L) tokenization, contrasting this approach with the standard left-to-right (L2R) tokenization. The research further investigates several confounds such as the effect of adding "thinking tokens" and the use of different delimiters to enforce R2L tokenization. The findings are substantiated through a variety of controlled experiments and comprehensively analyzed error patterns.

Key Findings

Enhanced Performance with R2L Tokenization: Introducing commas to force R2L tokenization leads to a significant performance boost in arithmetic tasks, with accuracy improvements up to 20%. This effect is relatively consistent across different model scales and versions.
Systematic Error Patterns in L2R Tokenization: The analysis of error patterns reveals a pronounced and systematic degradation in performance when the answer length differs from the addends, specifically under L2R tokenization. This suggests that the tokenization direction can introduce systematic but flawed reasoning processes within the models.
Mitigating Performance Disparity: The paper presents a novel mitigation strategy, whereby models are prompted to convert problems from a less preferred tokenization (L2R) to a more preferred one (R2L). This approach significantly recuperates performance levels, illustrating the models' capability to adapt between tokenization formats.

Implications and Future Directions

The findings highlight that tokenization schemes can induce strong inductive biases in models, affecting tasks that require numerical reasoning. This has both practical and theoretical implications, suggesting that model practitioners should carefully consider tokenization strategies during model training and development. Additionally, the identification of systematic error patterns opens avenues for further research into the underlying algorithms and mechanisms that LLMs employ for arithmetic reasoning.

As models continue to evolve, understanding the subtle yet impactful role of tokenization becomes imperative. Future work could explore alternative tokenization methods, such as tokenizer-free models or continuous number encoding schemes, to mitigate tokenization-induced biases. Furthermore, conducting large-scale ablation studies with varying tokenization strategies could offer deeper insights into optimizing LLMs for numerical reasoning and beyond.

Acknowledgements and Contributions

This paper serves as a testament to the intricate dynamics of tokenization in influencing LLM performance. By providing a systematic examination of tokenization-dependent effects in GPT models, the research contributes significantly to the ongoing discourse on optimizing LLMs for complex reasoning tasks. The thoughtful design and execution of the experiments, coupled with comprehensive error pattern analyses, offer valuable insights that can guide future advancements in LLM development and application.

PDF Markdown

Related Papers

Tweets

https://twitter.com/qintong_li/status/1764987700032192679

https://twitter.com/menhguin/status/1860320681902375422

https://twitter.com/Aaditya6284/status/1762558461437464726

https://twitter.com/lefthanddraft/status/1831996614933877130

https://twitter.com/knishimae0531/status/1762627344752312813

https://twitter.com/Aaditya6284/status/1762547663302545835

YouTube

Show All Videos

HackerNews

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs (2 points, 0 comments)