Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Less the Merrier? Investigating Language Representation in Multilingual Models (2310.13228v1)

Published 20 Oct 2023 in cs.CL

Abstract: Multilingual LLMs offer a way to incorporate multiple languages in one model and utilize cross-language transfer learning to improve performance for different NLP tasks. Despite progress in multilingual models, not all languages are supported as well, particularly in low-resource settings. In this work, we investigate the linguistic representation of different languages in multilingual models. We start by asking the question which languages are supported in popular multilingual models and which languages are left behind. Then, for included languages, we look at models' learned representations based on language family and dialect and try to understand how models' learned representations for~(1) seen and~(2) unseen languages vary across different language groups. In addition, we test and analyze performance on downstream tasks such as text generation and Named Entity Recognition. We observe from our experiments that community-centered models -- models that focus on languages of a given family or geographical location and are built by communities who speak them -- perform better at distinguishing between languages in the same family for low-resource languages. Our paper contributes to the literature in understanding multilingual models and their shortcomings and offers insights on potential ways to improve them.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Masakhaner: Named entity recognition for african languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
  2. Masakhanews: News topic classification for african languages. arXiv preprint arXiv:2304.09972.
  3. Arabert: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104.
  4. Felix Stollenwerk Joey Öhman Tim Isbister Evangelia Gogoulou Fredrik Carlsson Alice Heiman Judit Casademont Magnus Sahlgren Ariel Ekgren, Amaru Cuba Gyllensten. 2023. Gpt-sw3: An autoregressive language model for the nordic languages. arxiv preprint arxiv:2305.12987.
  5. Emily M Bender. 2019. The# benderrule: On naming the languages we study and why it matters. the gradient (2019).
  6. BigScience. 2022a. Bloom: A 176b-parameter open-access multilingual language model. arxiv preprint arxiv:2211.05100.
  7. BigScience. 2022b. Bloom: A 176b-parameter open-access multilingual language model. arxiv preprint arxiv:2211.05100.
  8. Language models are few-shot learners.
  9. Zhuowen Tu Chang, Tyler A and Benjamin K Bergen. 2022. The geometry of multilingual language model representations. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  10. Rochelle Choenni and Ekaterina Shutova. 2020. What does it mean to be language-agnostic? probing multilingual sentence encoders for typological properties. arxiv preprint arxiv.:2009.12862.
  11. Monojit Choudhury and Amit Deshpande. 2021. How linguistically fair are multilingual pre-trained language models? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12710–12718.
  12. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  13. IndicBART: A pre-trained model for indic natural language generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1849–1863, Dublin, Ireland. Association for Computational Linguistics.
  14. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  15. Masakhapos: Part-of-speech tagging for typologically diverse african languages. arXiv preprint arXiv:2305.13989.
  16. Gowtham Ramesh Mitesh M. Khapra Anoop Kunchukuttan Doddapaneni, Sumanth and Pratyush Kumar. 2021. A primer on pretrained multilingual language models.
  17. Towards leaving no indic language behind: Building monolingual corpora, benchmark and models for indic languages. ArXiv, abs/2212.05409.
  18. Raja Giryes Dor Bank, Noam Koenigstein. 2021. Autoencoders. arxiv preprint arxiv:2003.05991.
  19. Afrolm: A self-active learning-based multilingual pretrained language model for 23 african languages. arXiv preprint arXiv:2211.03263.
  20. Dan Garrette Eva Schlinger. 2019. How multilingual is multilingual bert? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  21. Wonsuk Yang Gaim, Fitsum and Jong C Park. 2022. Geezswitch: Language identification in typologically related low-resourced east african languages. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022).
  22. Lesan–machine translation for low resource languages. In NeurIPS 2021 Competitions and Demonstrations Track, pages 297–301. PMLR.
  23. Tigrinya dialect identification. In 4th Workshop on African Natural Language Processing.
  24. Hossain Shaikh Saadi Hangya, Viktor and Alexander Fraser. 2022. Improving low-resource languages in pre-trained multilingual language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  25. Alexander Fraser Jindrich Libovicky, Rudolf Rosa. 2019. How language-neutral is multilingual bert? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  26. Sebastin Santy Amar Budhiraja Kalika Bali Monojit Choudhury. Joshi, Pratik. 2020. The state and fate of linguistic diversity and inclusion in the nlp world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6282–93.
  27. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  28. James Melville Leland McInnes, John Healy. 2020. Umap: Uniform manifold approximation and projection for dimension reduction. arxiv preprint arxiv:1802.03426.
  29. Vinci Liu and James R Curran. 2006. Web text corpus for natural language processing. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 233–240.
  30. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  31. MBZUAI. 2023. Jais: a new pinnacle in open arabic nlp.
  32. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
  33. James Cross Onur Çelebi Maha Elbayad Kenneth Heafield Kevin Heffernan Elahe Kalbassi Janice Lam Daniel Licht Jean Maillard Anna Sun Skyler Wang Guillaume Wenzek Al Youngblood Bapi Akula Loic Barrault Gabriel Mejia Gonzalez Prangthip Hansanti John Hoffman Semarley Jarrett Kaushik Ram Sadagopan Dirk Rowe Shannon Spruit Chau Tran Pierre Andrews Necip Fazil Ayan Shruti Bhosale Sergey Edunov Angela Fan Cynthia Gao Vedanuj Goswami Francisco Guzmán Philipp Koehn Alexandre Mourachko Christophe Ropers Safiyyah Saleem Holger Schwenk Jeff Wang NLLB Team, Marta R. Costa-jussà. 2022. No language left behind: Scaling human-centered machine translation.
  34. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
  35. Ivan Vulić Iryna Gurevych Pfeiffer, Jonas and Sebastian Ruder. 2021. Unks everywhere: Adapting multilingual language models to new scripts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  36. Sara Rajaee and Mohammad Taher Pilehvar. 2022. An isotropy analysis in the multilingual bert embedding space. In Findings of the Association for Computational Linguistics: ACL.
  37. Sonal Gupta Rushin Shah Schuster, Sebastian and Mike Lewis. 2019. Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings of the 2019 Conference of the North America Association for Computational Linguistics.
  38. Sonit Singh. 2018. Natural language processing for information extraction. arXiv preprint arXiv:1807.02383.
  39. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 26–41.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  41. Karthikeyan K Stephen Mayhew Wang, Zihan and Dan Roth. 2020. Extending multilingual bert to low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP.
  42. Di He Tao Qin Zhou Zhao Xu Tan, Yi Ren and TieYan Liu. 2019. Multilingual neural machine translation with knowledge distillation. arXiv preprint arXiv:1902.10461.
  43. Holly Young. 2015. A language family tree - in pictures. The Guardian.
  44. AMBERT: A pre-trained language model with multi-grained tokenization. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 421–435, Online. Association for Computational Linguistics.
  45. Barnab´as P´oczos Jaime Carbonell Zirui Wang, Zihang Dai. 2019. Characterizing and avoiding negative transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  46. Yulia Tsvetkov Zirui Wang, Zachary C. Lipton. 2020. On negative interference in multilingual models: Findings and a meta-learning treatment. arxiv preprint arxiv:2010.03017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hellina Hailu Nigatu (6 papers)
  2. Atnafu Lambebo Tonja (27 papers)
  3. Jugal Kalita (64 papers)