Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tokenizer Choice For LLM Training: Negligible or Crucial? (2310.08754v4)

Published 12 Oct 2023 in cs.LG

Abstract: The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. In Harald Lüngen, Marc Kupietz, Piotr Bański, Adrien Barbaresi, Simon Clematide, and Ines Pisetta, editors, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), pages 1 – 9. Leibniz-Institut für Deutsche Sprache, Mannheim.
  2. A neural probabilistic language model. Advances in neural information processing systems, 13.
  3. Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, Online. Association for Computational Linguistics.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  6. Nadezhda Chirkova and Sergey Troshin. 2022. CodeBPE: Investigating subtokenization options for large language model pretraining on source code. In Deep Learning for Code Workshop.
  7. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL-HLT (1), pages 2924–2936. Association for Computational Linguistics.
  8. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91.
  9. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.
  10. Together Computer. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
  11. XNLI: evaluating cross-lingual sentence representations. In EMNLP, pages 2475–2485. Association for Computational Linguistics.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  13. Philip Gage. 1994. A new algorithm for data compression. The C Users Journal archive, 12:23–38.
  14. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  15. Character-level translation with self-attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1591–1604, Online. Association for Computational Linguistics.
  16. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  17. Modelling large parallel corpora: The zurich parallel corpus collection. In Proceedings of the 7th Workshop on Challenges in the Management of Large Corpora (CMLC), pages 1–8. Leibniz-Institut für Deutsche Sprache.
  18. Cleaning the Europarl corpus for linguistic applications. In Konvens 2014. Stiftung Universität Hildesheim.
  19. DCEP - Digital corpus of the European parliament. In Proc. LREC 2014 (Language Resources and Evaluation Conference). Reykjavik, Iceland, pages 3164–3171.
  20. Measuring mathematical problem solving with the MATH dataset. In NeurIPS Datasets and Benchmarks.
  21. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, volume 35, pages 30016–30030. Curran Associates, Inc.
  22. Training compute-optimal large language models. CoRR, abs/2203.15556.
  23. Stefan Höfler and Michael Piotrowski. 2011. Building corpora for the philological study of Swiss legal texts. Journal for Language Technology and Computational Linguistics, 26(2):77–89.
  24. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL (1), pages 1601–1611. Association for Computational Linguistics.
  25. P. Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79––86. Asia-Pacific Association for Machine Translation (AAMT).
  26. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
  27. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. EMNLP 2018, page 66.
  28. RACE: large-scale reading comprehension dataset from examinations. In EMNLP, pages 785–794. Association for Computational Linguistics.
  29. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052.
  30. Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting large parallel corpora from movie and tv subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016).
  31. Anthony Moi and Nicolas Patry. 2023. HuggingFace’s Tokenizers.
  32. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA. Association for Computing Machinery.
  33. The LAMBADA dataset: Word prediction requiring a broad discourse context. In ACL (1). The Association for Computer Linguistics.
  34. Morphology Matters: A Multilingual Language Modeling Analysis. Transactions of the Association for Computational Linguistics, 9:261–276.
  35. Language model tokenizers introduce unfairness between languages. arXiv preprint arXiv:2305.15425.
  36. How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3118–3135, Online. Association for Computational Linguistics.
  37. Winogrande: An adversarial winograd schema challenge at scale. In AAAI, pages 8732–8740. AAAI Press.
  38. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  39. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE.
  40. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  41. mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580.
  42. Felix Stollenwerk. 2023. Training and evaluation of a multilingual tokenizer for gpt-sw3. arXiv preprint arXiv:2304.14780.
  43. Charformer: Fast character transformers via gradient-based subword tokenization. In International Conference on Learning Representations.
  44. Impact of tokenization on language models: An analysis for turkish. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 22(4).
  45. Llama: open and efficient foundation language models, 2023. URL https://arxiv. org/abs/2302.13971.
  46. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  47. Attention is all you need. In NIPS, pages 5998–6008.
  48. Neural machine translation with byte-level subwords. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):9154–9160.
  49. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  50. Megabyte: Predicting million-byte sequences with multiscale transformers. arXiv preprint arXiv:2305.07185.
  51. Hellaswag: Can a machine really finish your sentence? In ACL (1), pages 4791–4800. Association for Computational Linguistics.
  52. How robust is neural machine translation to language imbalance in multilingual tokenizer training? In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 97–116, Orlando, USA. Association for Machine Translation in the Americas.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (21)
  1. Mehdi Ali (11 papers)
  2. Michael Fromm (24 papers)
  3. Klaudia Thellmann (4 papers)
  4. Richard Rutmann (4 papers)
  5. Max Lübbering (4 papers)
  6. Johannes Leveling (4 papers)
  7. Katrin Klug (2 papers)
  8. Jan Ebert (11 papers)
  9. Niclas Doll (1 paper)
  10. Jasper Schulze Buschhoff (3 papers)
  11. Charvi Jain (2 papers)
  12. Alexander Arno Weber (4 papers)
  13. Lena Jurkschat (2 papers)
  14. Hammam Abdelwahab (3 papers)
  15. Chelsea John (2 papers)
  16. Pedro Ortiz Suarez (15 papers)
  17. Malte Ostendorff (23 papers)
  18. Samuel Weinbach (11 papers)
  19. Rafet Sifa (32 papers)
  20. Stefan Kesselheim (16 papers)
Citations (31)