Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Drives Performance in Multilingual Language Models? (2404.19159v1)

Published 29 Apr 2024 in cs.CL

Abstract: This study investigates the factors influencing the performance of multilingual LLMs (MLLMs) across diverse languages. We study 6 MLLMs, including masked LLMs, autoregressive models, and instruction-tuned LLMs, on the SIB-200 dataset, a topic classification dataset encompassing 204 languages. Our analysis considers three scenarios: ALL languages, SEEN languages (present in the model's pretraining data), and UNSEEN languages (not present or documented in the model's pretraining data in any meaningful way). We examine the impact of factors such as pretraining data size, general resource availability, language family, and script type on model performance. Decision tree analysis reveals that pretraining data size is the most influential factor for SEEN languages. However, interestingly, script type and language family are crucial for UNSEEN languages, highlighting the importance of cross-lingual transfer learning. Notably, model size and architecture do not significantly alter the most important features identified. Our findings provide valuable insights into the strengths and limitations of current MLLMs and hope to guide the development of more effective and equitable multilingual NLP systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. arXiv preprint arXiv:2309.07445.
  2. All translation tools are not equal: Investigating the quality of language translation for forced migration. In 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10.
  3. Mega: Multilingual evaluation of generative ai. arXiv preprint arXiv:2303.12528.
  4. Palm 2 technical report.
  5. XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 258–266, Marseille, France. European Language Resources Association.
  6. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland. Association for Computational Linguistics.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  8. Tyler A Chang and Benjamin K Bergen. 2023. Language model behavior: A comprehensive survey. arXiv preprint arXiv:2303.11504.
  9. Palm: Scaling language modeling with pathways.
  10. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  12. Ethnologue. 2022. What are the largest language families?
  13. Dialectbench: A nlp benchmark for dialects, varieties, and closely-related languages.
  14. Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117, Toronto, Canada. Association for Computational Linguistics.
  15. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  16. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613.
  17. Few-shot learning with multilingual language models.
  18. Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  19. Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50–60.
  20. Crosslingual generalization through multitask finetuning.
  21. OpenAI. 2023. Gpt-4 technical report.
  22. Towards a common understanding of contributing factors for cross-lingual transfer in multilingual language models: A review. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5877–5891, Toronto, Canada. Association for Computational Linguistics.
  23. Surangika Ranathunga and Nisansa de Silva. 2022. Some languages are more equal than others: Probing deeper into the linguistic disparity in the nlp world. arXiv preprint arXiv:2210.08523.
  24. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  25. mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580.
  26. Llama: Open and efficient foundation language models.
  27. Don Vaughan. 2020. The world’s 5 most commonly used writing systems.
  28. All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905.
  29. Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual bert? arXiv preprint arXiv:2005.09093.
  30. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  31. Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages. arXiv preprint arXiv:2305.18098.
  32. Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint arXiv:2304.04675.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Sina Bagheri Nezhad (8 papers)
  2. Ameeta Agrawal (23 papers)
Citations (3)