Exploring the Maze of Multilingual Modeling (2310.05404v2)
Abstract: Multilingual LLMs have gained significant attention in recent years, enabling the development of applications that meet diverse linguistic contexts. In this paper, we present a comprehensive evaluation of three popular multilingual LLMs: mBERT, XLM-R, and GPT-3. We assess their performance across a diverse set of languages, with a focus on understanding the impact of resource availability (general and model-specific), language family, script type, and word order on model performance, under two distinct tasks - text classification and text generation. Our findings reveal that while the amount of language-specific pretraining data plays a crucial role in model performance, we also identify other factors such as general resource availability, language family, and script type, as important features. We hope that our study contributes to a deeper understanding of multilingual LLMs to enhance their performance across languages and linguistic contexts.
- Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. arXiv preprint arXiv:2309.07445.
- Mega: Multilingual evaluation of generative ai. arXiv preprint arXiv:2303.12528.
- Palm 2 technical report.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Tyler A Chang and Benjamin K Bergen. 2023. Language model behavior: A comprehensive survey. arXiv preprint arXiv:2303.11504.
- Palm: Scaling language modeling with pathways.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
- Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online (v2020.3). Zenodo.
- Ethnologue. 2022. What are the largest language families?
- Ronald A Fisher. 1922. On the interpretation of χ𝜒\chiitalic_χ 2 from contingency tables, and the calculation of p. Journal of the royal statistical society, 85(1):87–94.
- Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493.
- Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117, Toronto, Canada. Association for Computational Linguistics.
- The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
- Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613.
- Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics, 8:726–742.
- Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50–60.
- OpenAI. 2023. Gpt-4 technical report.
- Surangika Ranathunga and Nisansa de Silva. 2022. Some languages are more equal than others: Probing deeper into the linguistic disparity in the nlp world. arXiv preprint arXiv:2210.08523.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580.
- Llama: Open and efficient foundation language models.
- Don Vaughan. 2020. The world’s 5 most commonly used writing systems.
- All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905.
- Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual bert? arXiv preprint arXiv:2005.09093.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint arXiv:2304.04675.
- Sina Bagheri Nezhad (8 papers)
- Ameeta Agrawal (23 papers)