Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MaLA-500: Massive Language Adaptation of Large Language Models (2401.13303v2)

Published 24 Jan 2024 in cs.CL

Abstract: LLMs have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel LLM designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages. We release MaLA-500 at https://huggingface.co/MaLA-LM

Introduction

The development of LLMs such as LLaMA, Mistral, and ChatGPT has significantly advanced natural language processing, particularly for English and other high-resource languages. Nevertheless, the effectiveness of these models diminishes for low-resource languages due to data scarcity and limited model capacity. In light of this, "MaLA-500: Massive Language Adaptation of LLMs" has emerged as an innovative solution to bridge this linguistic divide, aimed at drastically enhancing language coverage across 534 languages by employing vocabulary extension and continued pretraining on a substantial new corpus, Glot500-c. The evaluation on SIB-200 showcases the enhanced in-context learning afforded by MaLA-500.

Methodology

The authors' methodology pivots on four main facets: high-quality data, a solid foundation model, rigorous vocabulary extension, and strategic continued pretraining. The pivotal choice of LLaMA 2 as the base model bridges existing gaps, leveraging its training on 2 trillion tokens. The salient step of vocabulary extension involves integrating a newly trained multilingual tokenizer with the existing one from LLaMA 2. This extension paves the way for improved encoding efficiency across a myriad of languages, reflected in a dramatic reduction in segmentation length, an especially pronounced benefit for languages written in non-Latin scripts.

Continued pretraining is executed with the integration of LoRA to maintain efficiency, gradually refining the model's ability to elucidate from new data across an expansive linguistic landscape. Concomitant hardware prowess and optimal software deployment, including the use of state-of-the-art frameworks and sophisticated redundancy optimizers, underpin the efficient and environmentally mindful training process.

Evaluation

"Mala-500" has been meticulously evaluated against the backdrop of contemporary LLMs on the SIB-200 dataset. The model's unparalleled 3-shot in-context learning performance is reflected in its substantial lead over peers. The nuanced analysis showcases MaLA-500’s robustness, significantly minimizing languages with poor performance while elevating those with an accuracy surpassing 60%. Further, the model's adaptability shines across varying levels of data resourcefulness, underpinning the utility of vocabulary extension and corroborating its correlation with performance gains. The experiment additionally sheds light on the link between the number of in-context shots and accuracy, delineating how MaLA-500 reaches optimal performance with 6-10 shots.

Related Work and Conclusion

The literature stretches across multilingual model prowess, citing illustrious predecessors in mBERT, XLM-R, mGPT, and others. The concurrent exploration of models like Glot500-m and SERENGETI reveals the mounting ambition to accommodate an ever-growing number of languages. However, "MaLA-500" sets itself apart with its herculean language coverage via continual training and utilizing an open model architecture.

In conclusion, the paper underlines a landmark stride in LLMs, tailoring to an unprecedented scope of languages while conscientiously considering the computational and environmental costs. The public release of model weights sets the stage for broadened future research and application scopes. Despite its strengths, the paper duly acknowledges limitations including data inclusivity of high-resource languages and the maximum parameter cap. Moreover, it raises an ethical spotlight on the potential propagation of biases. The groundwork laid down in "MaLA-500" thus stands as a clarion call for extended research pathways, ethical diligence, and continued technological inclusivity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. SERENGETI: Massively multilingual language models for africa. arXiv preprint arXiv:2212.10785.
  2. SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. CoRR, abs/2309.07445.
  3. MEGA: multilingual evaluation of generative AI. CoRR, abs/2303.12528.
  4. MEGAVERSE: benchmarking large language models across languages, modalities, models and tasks. CoRR, abs/2311.07463.
  5. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023.
  6. Instruct-align: Teaching novel languages with to LLMs through alignment-based cross-lingual instruction. CoRR, abs/2305.13627.
  7. Parsing with multilingual bert, a small treebank, and a small corpus. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1324–1334. Association for Computational Linguistics.
  8. Monolingual or multilingual instruction tuning: Which makes a better Alpaca. CoRR, abs/2309.08958.
  9. Improving language plasticity via pretraining with active forgetting. CoRR, abs/2307.01163.
  10. ELECTRA: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations.
  11. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
  12. Efficient and effective text encoding for Chinese LLaMA and Alpaca. CoRR, abs/2304.08177.
  13. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  14. Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages. CoRR, abs/2309.04679.
  15. Abteen Ebrahimi and Katharina Kann. 2021. How to adapt your pretrained multilingual model to 1600 languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 4555–4567. Association for Computational Linguistics.
  16. Fahim Faisal and Antonios Anastasopoulos. 2022. Phylogeny-inspired adaptation of multilingual models to new languages. CoRR, abs/2205.09634.
  17. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  18. Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117, Toronto, Canada. Association for Computational Linguistics.
  19. Mistral 7b. CoRR, abs/2310.06825.
  20. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  21. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association for Computational Linguistics.
  22. MADLAD-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662.
  23. ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. CoRR, abs/2304.05613.
  24. Few-shot learning with multilingual language models. CoRR, abs/2112.10668.
  25. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  26. When being unseen from mbert is just the beginning: Handling new languages with multilingual language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 448–462. Association for Computational Linguistics.
  27. Can multilingual language models transfer to an unseen dialect? A case study on north african arabizi. CoRR, abs/2005.00318.
  28. Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, EACL 2021, Online, April 19-23, 2021, pages 80–90. Association for Computational Linguistics.
  29. MAD-X: an adapter-based framework for multi-task cross-lingual transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 7654–7673. Association for Computational Linguistics.
  30. Unks everywhere: Adapting multilingual language models to new scripts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 10186–10203. Association for Computational Linguistics.
  31. ZeRO: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  32. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  33. BLOOM: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  34. mGPT: Few-shot learners go multilingual. CoRR, abs/2204.07580.
  35. UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
  36. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971.
  37. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  38. Udapter: Language adaptation for truly universal dependency parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 2302–2315. Association for Computational Linguistics.
  39. Expanding pretrained models to thousands more languages via lexicon-based adaptation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 863–877. Association for Computational Linguistics.
  40. Extending multilingual BERT to low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 2649–2656. Association for Computational Linguistics.
  41. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  42. A paradigm shift in machine translation: Boosting translation performance of large language models. CoRR, abs/2309.11674.
  43. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498.
  44. Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages. CoRR, abs/2305.18098.
  45. BLOOM+1: adding language support to BLOOM for zero-shot prompting. CoRR, abs/2212.09535.
  46. LLaMA beyond English: An empirical study on language capability transfer. arXiv preprint arXiv:2401.01055.
  47. Extrapolating large language models to non-english by aligning languages. CoRR, abs/2308.04948.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Peiqin Lin (15 papers)
  2. Shaoxiong Ji (39 papers)
  3. Jörg Tiedemann (41 papers)
  4. André F. T. Martins (113 papers)
  5. Hinrich Schütze (250 papers)
Citations (12)