Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Greed is All You Need: An Evaluation of Tokenizer Inference Methods (2403.01289v2)

Published 2 Mar 2024 in cs.CL

Abstract: While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes, performed on a novel intrinsic evaluation suite we curated for English, combining measures rooted in morphology, cognition, and information theory. We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.

Evaluating Tokenizer Inference Methods: A Controlled Analysis

Introduction

NLP systems routinely convert raw text into sequences of subword tokens using algorithms like Byte-Pair Encoding (BPE), WordPiece, or UnigramLM. Although much attention has been devoted to optimizing these tokenization algorithms, the process of inferring the sequence of tokens from these vocabularies—a critical component known as the inference method—has remained under-explored. In a paper, a comprehensive analysis of seven tokenizer inference methods was performed across four different algorithms (BPE, UnigramLM, WordPiece, and SaGe) and three vocabulary sizes. This research unveiled surprising findings about the efficacy of these methods and outlined their implications for future developments in the field.

Investigation into Inference Methods

Subword tokenization plays a pivotal role in how text data is represented for NLP models. The paper put under the microscope not just the well-known tokenizer vocabularies but also the associated inference methods, which dictate how the text is broken down into the tokens provided by these vocabularies. The inquiry centered on:

  • Greedy inference methods, which iteratively choose one token at each step based on certain criteria (e.g., longest prefix/suffix or token).
  • Merge rules-based inference methods, where word character sequences are iteratively merged according to predefined rules.
  • Likelihood-based inference methods, which utilize token likelihoods to find the most probable segmentation of a word.

Their performance was measured using a variety of intrinsic evaluations that ranged from aligning with morphological segmentation, cognitive plausibility, to information-theoretical considerations.

Benchmarking Results and Insights

The findings from this rigorous evaluation showed that greedy inference methods, which are relatively simple in approach, performed remarkably well across a variety of metrics. This was particularly evident in their ability to align with morphological standards, suggesting a latent prowess in handling complex word forms. Among the evaluated tokenizers, SaGe—a newly introduced tokenizer—demonstrated superior performance in morphological alignment, suggesting its sophisticated mechanism was advantageous for capturing the subtleties of word structure.

In terms of vocabulary-size influence, the paper illuminated how certain inference methods scaled with vocabulary adjustments, providing key insights into their robustness and utility across different dataset magnitudes.

Implications and Future Directions

The implications of these findings are manifold:

  • Decoupling Tokenization and Inference: The paper underscores the potential benefits of decoupling vocabulary creation from the inference method, advocating for the flexibility to choose the most suitable inference method depending on the task.
  • Greedy Methods’ Surprising Efficacy: The success of greedy inference methods calls for a reassessment of their role in tokenizer design, potentially encouraging their adoption in scenarios where complex tokenization algorithms were previously thought necessary.
  • Advancements in Tokenizer Design: The standout performance of SaGe offers promising directions for future tokenizer designs, particularly for applications requiring nuanced understanding of language morphology.

In conclusion, by providing a comprehensive analysis of tokenizer inference methods, this paper paves the way for more informed choices in tokenizer selection and design. It not only highlights the often-overlooked importance of inference methods but also opens the door for future investigations that could lead to even more efficient and effective NLP systems. The ongoing evolution of tokenization strategies, as evidenced by this research, is crucial for the advancement of LLMs and their applications, offering a richer understanding of language processing at both theoretical and practical levels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. MorphyNet: a large multilingual database of derivational and inflectional morphology. In Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 39–48, Online. Association for Computational Linguistics.
  2. UniMorph 4.0: Universal Morphology. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 840–855, Marseille, France. European Language Resources Association.
  3. Thomas Bauwens. 2023. BPE-knockout: Systematic review of BPE tokenisers and their flaws with application in Dutch morphology. Master’s thesis, KU Leuven.
  4. Lisa Beinborn and Yuval Pinter. 2023. Analyzing cognitive plausibility of subword tokenization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4478–4486, Singapore. Association for Computational Linguistics.
  5. Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, Online. Association for Computational Linguistics.
  6. Two counterexamples to Tokenization and the Noiseless Channel.
  7. Mathias Creutz and Bo Krister Johan Linden. 2004. Morpheme segmentation gold standards for finnish and english.
  8. Ladec: The large database of english compounds. Behavior Research Methods, 51:2152 – 2179.
  9. Improving tokenisation by alternative treatment of spaces. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11430–11443, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  10. Thamme Gowda and Jonathan May. 2020. Finding the optimal vocabulary size for neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3955–3964, Online. Association for Computational Linguistics.
  11. From characters to words: the turning point of BPE merges. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3454–3468, Online. Association for Computational Linguistics.
  12. Dynamic programming encoding for subword segmentation in neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3042–3051, Online. Association for Computational Linguistics.
  13. DagoBERT: Generating derivational morphology with a pretrained language model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3848–3861, Online. Association for Computational Linguistics.
  14. Superbizarre is not superb: Derivational morphology improves BERT’s interpretation of complex words. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3594–3608, Online. Association for Computational Linguistics.
  15. An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385–393, Dublin, Ireland. Association for Computational Linguistics.
  16. Cassandra L Jacobs and Yuval Pinter. 2022. Lost in space marking. arXiv preprint arXiv:2208.01561.
  17. Jean Kaddour. 2023. The minipile challenge for data-efficient language models. arXiv preprint arXiv:2304.08442.
  18. Stav Klein and Reut Tsarfaty. 2020. Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology? In Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 204–209, Online. Association for Computational Linguistics.
  19. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
  20. XLM-V: Overcoming the vocabulary bottleneck in multilingual masked language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13142–13152, Singapore. Association for Computational Linguistics.
  21. Between words and characters: a brief history of open-vocabulary modeling and tokenization in nlp. arXiv preprint arXiv:2112.10508.
  22. CompoundPiece: Evaluating and improving decompounding performance of language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 343–359, Singapore. Association for Computational Linguistics.
  23. Will it unblend? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1525–1535, Online. Association for Computational Linguistics.
  24. Character eyes: Seeing language through character-level taggers. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 95–102, Florence, Italy. Association for Computational Linguistics.
  25. BPE-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, Online. Association for Computational Linguistics.
  26. Jonne Saleva and Constantine Lignos. 2023. What changes when you randomly choose BPE merge operations? not much. In The Fourth Workshop on Insights from Negative Results in NLP, pages 59–66, Dubrovnik, Croatia. Association for Computational Linguistics.
  27. Morpholex: A derivational morphological database for 70,000 english words. Behavior Research Methods, 50:1568–1580.
  28. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE.
  29. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  30. Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv, abs/1609.08144.
  31. Shaked Yehezkel and Yuval Pinter. 2023. Incorporating context into subword vocabularies. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 623–635, Dubrovnik, Croatia. Association for Computational Linguistics.
  32. Tokenization and the noiseless channel. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5184–5207, Toronto, Canada. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Omri Uzan (3 papers)
  2. Craig W. Schmidt (6 papers)
  3. Chris Tanner (18 papers)
  4. Yuval Pinter (41 papers)
Citations (9)