Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models (2301.10472v2)

Published 25 Jan 2023 in cs.CL and cs.LG

Abstract: Large multilingual LLMs typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This \textit{vocabulary bottleneck} limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual LLM with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), to named entity recognition (WikiAnn). XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Masakhaner: named entity recognition for african languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
  2. On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856.
  3. Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853.
  4. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297.
  5. Improving multilingual models with language-clustered vocabularies. EMNLP.
  6. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
  7. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91.
  8. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  9. Xnli: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  11. Americasnli: Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. arXiv preprint arXiv:2104.08726.
  12. Larger-scale transformers for multilingual masked language modeling. arXiv preprint arXiv:2105.00572.
  13. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  14. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pages 4411–4421. PMLR.
  15. Efficient softmax approximation for gpus. In International conference on machine learning, pages 1302–1310. PMLR.
  16. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  17. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  18. Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
  19. Mlqa: Evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475.
  20. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
  21. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  22. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  23. Fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038.
  24. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958.
  25. Massively multilingual transfer for ner. arXiv preprint arXiv:1902.00193.
  26. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  27. How good is your tokenizer? on the monolingual performance of multilingual language models. arXiv preprint arXiv:2012.15613.
  28. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  29. Andrew Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 13(2):260–269.
  30. Improving pre-trained multilingual models with vocabulary expansion. arXiv preprint arXiv:1909.12440.
  31. Wikipedia. 2023. Table of general standard chinese characters — wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Table%20of%20General%20Standard%20Chinese%20Characters&oldid=1123968033. [Online; accessed 05-January-2023].
  32. The multi-genre nli corpus.
  33. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  34. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
  35. Allocating large vocabulary capacity for cross-lingual language model pre-training. EMNLP.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Davis Liang (15 papers)
  2. Hila Gonen (30 papers)
  3. Yuning Mao (34 papers)
  4. Rui Hou (56 papers)
  5. Naman Goyal (37 papers)
  6. Marjan Ghazvininejad (33 papers)
  7. Luke Zettlemoyer (225 papers)
  8. Madian Khabsa (38 papers)
Citations (61)
X Twitter Logo Streamline Icon: https://streamlinehq.com