Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tokenization Preference for Human and Machine Learning Model: An Annotation Study (2304.10813v3)

Published 21 Apr 2023 in cs.CL

Abstract: Is preferred tokenization for humans also preferred for machine-learning (ML) models? This study examines the relations between preferred tokenization for humans (appropriateness and readability) and one for ML models (performance on an NLP task). The question texts of the Japanese commonsense question-answering dataset are tokenized with six different tokenizers, and the performances of human annotators and ML models were compared. Furthermore, we analyze relations among performance of answers by human and ML model, the appropriateness of tokenization for human, and response time to questions by human. This study provides a quantitative investigation result that shows that preferred tokenizations for humans and ML models are not necessarily always the same. The result also implies that existing methods using LLMs for tokenization could be a good compromise both for human and ML models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Lisa Beinborn and Yuval Pinter. 2023. Analyzing cognitive plausibility of subword tokenization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4478–4486, Singapore. Association for Computational Linguistics.
  2. Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of ACL: EMNLP 2020, pages 4617–4624.
  3. Besting the quiz master: Crowdsourcing incremental classification games. In Proceedings of the 2012 Joint Conference on EMNLP and CoNLL, pages 1290–1301, Jeju Island, Korea. Association for Computational Linguistics.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  5. Robert MW Dixon and Alexandra Y Aikhenvald. 2003. Word: A cross-linguistic typology. Cambridge University Press.
  6. A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1):21–54.
  7. Thamme Gowda and Jonathan May. 2020. Finding the optimal vocabulary size for neural machine translation. In Findings of ACL: EMNLP 2020, pages 3955–3964, Online. Association for Computational Linguistics.
  8. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional lstm networks. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 4, pages 2047–2052. IEEE.
  9. Dynamic programming encoding for subword segmentation in neural machine translation. In Proceedings of the 58th Annual Meeting of ACL, pages 3042–3051, Online. Association for Computational Linguistics.
  10. Tatsuya Hiraoka. 2022. Maxmatch-dropout: Subword regularization for wordpiece. In Proceedings of the 29th COLING, pages 4864–4872.
  11. Stochastic tokenization with a language model for neural text classification. In Proceedings of the 57th Annual Meeting of ACL, pages 1620–1629.
  12. Optimizing word segmentation for downstream task. In Findings of ACL: EMNLP 2020, pages 1341–1351, Online. Association for Computational Linguistics.
  13. Joint optimization of tokenization and downstream model. In Findings of ACL: ACL-IJCNLP 2021, pages 244–255.
  14. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  15. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  16. Taku Kudo. 2006. Mecab: Yet another part-of-speech and morphological analyzer. http://taku910.github.io/mecab/.
  17. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on EMNLP: System Demonstrations, pages 66–71.
  18. JGLUE: Japanese general language understanding evaluation. In Proceedings of the Thirteenth LREC, pages 2957–2966, Marseille, France. European Language Resources Association.
  19. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of NAACL: HLT, pages 260–270.
  20. On the effectiveness of images in multi-modal text classification: An annotation study. ACM Trans. Asian Low-Resour. Lang. Inf. Process. Just Accepted.
  21. Hans Marchand. 1969. The categories and types of present-day English word-formation: A synchronic-diachronic approach. Beck.
  22. Bayesian unsupervised word segmentation with nested pitman-yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 100–108. Association for Computational Linguistics.
  23. Chinese segmentation and new word detection using conditional random fields. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 562–568, Geneva, Switzerland. COLING.
  24. Deep contextualized word representations. In Proceedings of the 2018 Conference of NAACL: HLT, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  25. BPE-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of ACL, pages 1882–1892, Online. Association for Computational Linguistics.
  26. Optimizing segmentation granularity for neural machine translation. Machine Translation, pages 1–19.
  27. Toshinori Sato. 2015. Neologism dictionary based on the language resources on the web for mecab.
  28. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of ACL (Volume 1: Long Papers), volume 1, pages P1715–1725.
  29. BioMegatron: Larger biomedical domain language model. In Proceedings of the 2020 Conference on EMNLP, pages 4700–4706, Online. Association for Computational Linguistics.
  30. Fast wordpiece tokenization. In Proceedings of the 2021 Conference on EMNLP, pages 2089–2103.
  31. A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3):377–404.
  32. Convolutional neural network with word embeddings for chinese word segmentation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 163–172.
  33. Nianwen Xue. 2003. Chinese word segmentation as character tagging. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, pages 29–48.
  34. Deep learning for Chinese word segmentation and POS tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 647–657, Seattle, Washington, USA. Association for Computational Linguistics.
  35. Tokenization and the noiseless channel. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5184–5207, Toronto, Canada. Association for Computational Linguistics.

Summary

We haven't generated a summary for this paper yet.