Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Analyzing Cognitive Plausibility of Subword Tokenization (2310.13348v1)

Published 20 Oct 2023 in cs.CL

Abstract: Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Spalex: A spanish lexical decision database from a massive online data collection. Frontiers in Psychology, 9.
  2. John Aldrich. 1995. Correlations genuine and spurious in pearson and yule. Statistical science, pages 364–376.
  3. Simona Amenta and Davide Crepaldi. 2012. Morphological processing as we know it: An analytical review of morphological effects in visual word identification. Frontiers in psychology, 3:232.
  4. The SIGMORPHON 2022 shared task on morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 103–116, Seattle, Washington. Association for Computational Linguistics.
  5. Morphological processing across modalities and languages. Scientific Studies of Reading, 24(6):500–519.
  6. The leipzig corpora collection-monolingual corpora of standard size. Proceedings of Corpus Linguistic, 2007.
  7. Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, Online. Association for Computational Linguistics.
  8. Improving multilingual models with language-clustered vocabularies. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4536–4546, Online. Association for Computational Linguistics.
  9. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91.
  10. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  11. Finding the man amongst many: A developmental perspective on mechanisms of morphological decomposition. Cognition, 211:104605.
  12. The french lexicon project: Lexical decision data for 38,840 french words and 38,840 pseudowords. Behavior research methods, 42:488–496.
  13. Philip Gage. 1994. A new algorithm for data compression. C Users Journal, 12(2):23–38.
  14. Matthias Gallé. 2019. Investigating the effectiveness of BPE: The power of shorter sequences. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1375–1381, Hong Kong, China. Association for Computational Linguistics.
  15. Superbizarre is not superb: Derivational morphology improves BERT’s interpretation of complex words. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3594–3608, Online. Association for Computational Linguistics.
  16. An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385–393, Dublin, Ireland. Association for Computational Linguistics.
  17. Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 dutch mono- and disyllabic words and nonwords. Frontiers in Psychology, 1.
  18. The british lexicon project: Lexical decision data for 28,730 monosyllabic and disyllabic english words. Behavior research methods, 44:287–304.
  19. Individual differences in language acquisition and processing. Trends in cognitive sciences, 22(2):154–169.
  20. Stav Klein and Reut Tsarfaty. 2020. Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology? In Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 204–209, Online. Association for Computational Linguistics.
  21. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
  22. Xlm-v: Overcoming the vocabulary bottleneck in multilingual masked language models. arXiv preprint arXiv:2301.10472.
  23. Zachary C Lipton and Jacob Steinhardt. 2019. Troubling trends in machine learning scholarship: Some ml papers suffer from flaws that could mislead the public and stymie future research. Queue, 17(1):45–77.
  24. BPE vs. morphological segmentation: A case study on machine translation of four polysynthetic languages. In Findings of the Association for Computational Linguistics: ACL 2022, pages 961–971, Dublin, Ireland. Association for Computational Linguistics.
  25. Wine is not v i n. on the compatibility of tokenizations across languages. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2382–2399, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  26. Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp. ArXiv, abs/2112.10508.
  27. Morphology matters: A multilingual language modeling analysis. Transactions of the Association for Computational Linguistics, 9:261–276.
  28. Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
  29. Ben Peters and Andre F. T. Martins. 2022. Beyond characters: Subword-level morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 131–138, Seattle, Washington. Association for Computational Linguistics.
  30. Linguistically debatable or just plain wrong? In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 507–511, Baltimore, Maryland. Association for Computational Linguistics.
  31. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  32. Language modelling with pixels. In The Eleventh International Conference on Learning Representations.
  33. How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3118–3135, Online. Association for Computational Linguistics.
  34. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE.
  35. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  36. Patience Stevens and David C Plaut. 2022. From decomposition to distributed theories of morphological processing in reading. Psychonomic Bulletin & Review, 29(5):1673–1702.
  37. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  38. Unsupervised text segmentation predicts eye fixations during reading. Frontiers in Artificial Intelligence, 5.
  39. Shaked Yehezkel and Yuval Pinter. 2023. Incorporating context into subword vocabularies. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 623–635, Dubrovnik, Croatia. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Lisa Beinborn (17 papers)
  2. Yuval Pinter (41 papers)
Citations (10)