Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement (2403.13754v1)
Abstract: The relationship between LLM tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, even when induced artificially for words that would not be tokenized that way during training. We then present exploratory analyses demonstrating that LLM embeddings for different plural tokenizations have similar distributions along the embedding space axis that maximally distinguishes singular and plural nouns. Our results suggest that morphologically-aligned tokenization is a viable tokenization approach, and existing models already generalize some morphological patterns to new items. However, our results indicate that morphological tokenization is not strictly required for performance.
- Héctor Martínez Alonso and Daniel Zeman. 2016. Universal Dependencies for the AnCora treebanks. Procesamiento del Lenguaje Natural, 57.
- Oral frequency norms for 67,979 Spanish words. Behavior Research Methods, 43:449–458.
- Spanish pre-trained BERT model and evaluation data. In PML4DC at ICLR 2020.
- Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48.
- Daniel Edmiston. 2020. A systematic analysis of morphological content in BERT models for multiple languages. arXiv preprint arXiv:2004.03032.
- Coleman Haley. 2020. This is a BERT. Now there are several of them. Can they generalize to novel words? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 333–341.
- Superbizarre is not superb: Derivational morphology improves BERT’s interpretation of complex words. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3594–3608, Online. Association for Computational Linguistics.
- Haris Jabbar. 2024. MorphPiece: A linguistic tokenizer for large language models. arXiv.
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
- Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
- Cross-linguistic syntactic evaluation of word prediction models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5523–5539, Online. Association for Computational Linguistics.
- Causal analysis of syntactic agreement neurons in multilingual language models. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 95–109, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- An empirical study of tokenization strategies for various Korean NLP tasks. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 133–142, Suzhou, China. Association for Computational Linguistics.
- Assessing the syntactic capabilities of transformer-based multilingual language models. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3799–3812, Online. Association for Computational Linguistics.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- Impact of tokenization on language models: An analysis for Turkish. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4):1–21.
- Michael T Ullman. 2016. The declarative/procedural model: A neurobiological model of language learning, knowledge, and use. In Neurobiology of language, pages 953–968. Elsevier.
- Greed is all you need: An evaluation of tokenizer inference methods. arXiv preprint arXiv:2403.01289.
- Counting the bugs in ChatGPT’s wugs: A multilingual investigation into the morphological capabilities of a large language model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6508–6524, Singapore. Association for Computational Linguistics.