Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance (2403.06265v2)

Published 10 Mar 2024 in cs.CL, cs.AI, and cs.LG
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

Abstract: Despite it being the cornerstone of BPE, the most common tokenization algorithm, the importance of compression in the tokenization process is still unclear. In this paper, we argue for the theoretical importance of compression, that can be viewed as 0-gram LLMing where equal probability is assigned to all tokens. We also demonstrate the empirical importance of compression for downstream success of pre-trained LLMs. We control the compression ability of several BPE tokenizers by varying the amount of documents available during their training: from 1 million documents to a character-based tokenizer equivalent to no training data at all. We then pre-train English LLMs based on those tokenizers and fine-tune them over several tasks. We show that there is a correlation between tokenizers' compression and models' downstream performance, suggesting that compression is a reliable intrinsic indicator of tokenization quality. These correlations are more pronounced for generation tasks (over classification) or for smaller models (over large ones). We replicated a representative part of our experiments on Turkish and found similar results, confirming that our results hold for languages with typological characteristics dissimilar to English. We conclude that building better compressing tokenizers is a fruitful avenue for further research and for improving overall model performance.

Unpacking Tokenization: A Close Look at Text Compression and Model Performance

Introduction

The paper "Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance" explores the significance of text compression in the tokenization process and its correlation with the downstream success of pre-trained LLMs. The authors argue that text compression can be viewed as a form of $0$-gram LLMing where all tokens are assigned equal probability. By manipulating the compression ability of Byte Pair Encoding (BPE) tokenizers through varying the amount of training data—ranging from a character-level tokenizer (equivalent to zero training data) to tokenizers trained on 1 million documents—the authors endeavor to elucidate the intrinsic quality of tokenizers and their extrinsic impact on model performance across several tasks and languages.

Methodology

The authors compared tokenizers by controlling the "support," i.e., the amount of training data available to them. This approach allowed for an exploration of how tokenizer compression abilities impact LLM performance across different tasks. The English language was the primary focus, with models pre-trained on the C4 corpus and fine-tuned on a combination of classification and generation tasks. For intrinsic evaluation, tokenizers' ability to compress text was measured, while extrinsic evaluation focused on performance across selected NLP tasks. Additionally, Turkish was selected for a subset of experiments to confirm whether findings hold across languages with different typological characteristics.

Findings

Compression Ability: The paper found a direct correlation between a tokenizer's compression ability and the amount of supporting data it was trained on. Tokenizers trained with minimal data produced texts significantly longer than those trained with adequate data. The more the supporting data, the better the compression.

Extrinsic Performance: The experiments demonstrated a monotonic relationship between the amount of supporting data a tokenizer had and its subsequent performance in downstream tasks. This correlation was found to be stronger for generation tasks and more pronounced in smaller models.

Language Generalization: The patterns observed in English held true when tested on Turkish, suggesting that the importance of text compression in tokenization is not language-specific.

Analysis

The paper breaks new ground by quantitatively demonstrating the importance of tokenization—specifically its compression capability—on the performance of LLMs. The results suggest that tokenization, particularly for generative tasks or when using smaller models, is crucial. This stands to reason as generative tasks require extensive use of the tokenizer, and smaller models have less capacity to compensate for poor tokenization.

Interestingly, the intrinsic and extrinsic evaluations of tokenization quality presented in this paper reveal a clear path for future research and development: creating better compressing tokenizers could lead to improved overall model performance. It was also noted that tokenizer support directly affects its efficiency in compression, pointing to the potential benefits of increasing the supporting dataset size during tokenizer training.

Conclusion

This paper contributes a novel perspective on the crucial role of tokenization in the development of LLMs by showcasing the intrinsic value of compression as an indicator of tokenizer quality and its correlation with downstream task performance. The findings across English and Turkish emphasize the importance of compression in tokenization and suggest beneficial directions for future tokenizer development. As larger and more complex models continue to evolve, understanding the foundational elements, such as tokenization, becomes imperative for improving efficiency and effectiveness in natural language processing tasks.

Future Work

While this paper provides significant insights, it also opens avenues for future research, including expanding the experiments to other languages and exploring other intrinsic measures of tokenization quality. Additionally, investigating the impact of tokenization on larger models could further refine our understanding of its role in the performance of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Lisa Beinborn and Yuval Pinter. 2023. Analyzing cognitive plausibility of subword tokenization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4478–4486, Singapore. Association for Computational Linguistics.
  2. Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, Online. Association for Computational Linguistics.
  3. Stanley F Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–394.
  4. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91.
  5. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  6. Mathias Creutz and Krista Lagus. 2002. Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pages 21–30. Association for Computational Linguistics.
  7. J-B Estoup. 1912. Gammes sténographiques. recueil de textes choisis pour l’acquisition méthodique de la vitesse, précédé d’une introduction par j.-b. estoup.
  8. Philip Gage. 1994. A new algorithm for data compression. C Users Journal, 12(2):23–38.
  9. Team Gemini. 2023. Gemini: A family of highly capable multimodal models.
  10. Olmo: Accelerating the science of language models.
  11. Large pre-trained models with extra-large vocabularies: A contrastive analysis of hebrew bert models and a new one to outperform them all.
  12. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
  13. Superbizarre is not superb: Derivational morphology improves BERT’s interpretation of complex words. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3594–3608, Online. Association for Computational Linguistics.
  14. An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385–393, Dublin, Ireland. Association for Computational Linguistics.
  15. Breaking character: Are subwords good enough for mrls after all?
  16. Stav Klein and Reut Tsarfaty. 2020. Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology? In Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 204–209, Online. Association for Computational Linguistics.
  17. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
  18. Subword language modeling with neural networks.
  19. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
  20. BPE-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, Online. Association for Computational Linguistics.
  21. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  22. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  23. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189.
  24. Vision transformers with mixed-resolution tokenization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4612–4621.
  25. How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3118–3135, Online. Association for Computational Linguistics.
  26. Tokenlearner: Adaptive space-time tokenization for videos. In Advances in Neural Information Processing Systems, volume 34, pages 12786–12797. Curran Associates, Inc.
  27. Jonne Saleva and Constantine Lignos. 2023. What changes when you randomly choose BPE merge operations? not much. In The Fourth Workshop on Insights from Negative Results in NLP, pages 59–66, Dubrovnik, Croatia. Association for Computational Linguistics.
  28. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE.
  29. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  30. Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423.
  31. Llama 2: Open foundation and fine-tuned chat models.
  32. Attention is all you need. Advances in neural information processing systems, 30.
  33. David Vilar and Marcello Federico. 2021. A statistical extension of byte-pair encoding. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 263–275, Bangkok, Thailand (online). Association for Computational Linguistics.
  34. Morfessor 2.0: Python implementation and extensions for morfessor baseline.
  35. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  36. Terry Winograd. 1971. Procedures as a representation for data in a computer program for understanding natural language. Technical report, MASSACHUSETTS INST OF TECH CAMBRIDGE PROJECT MAC.
  37. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  38. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  39. Shaked Yehezkel and Yuval Pinter. 2023. Incorporating context into subword vocabularies. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 623–635, Dubrovnik, Croatia. Association for Computational Linguistics.
  40. George Kingsley Zipf. 1949. Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Omer Goldman (14 papers)
  2. Avi Caciularu (46 papers)
  3. Matan Eyal (15 papers)
  4. Kris Cao (16 papers)
  5. Idan Szpektor (47 papers)
  6. Reut Tsarfaty (54 papers)
Citations (13)