MaLA-500: Massive Language Adaptation of Large Language Models (2401.13303v2)
Abstract: LLMs have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel LLM designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages. We release MaLA-500 at https://huggingface.co/MaLA-LM
- SERENGETI: Massively multilingual language models for africa. arXiv preprint arXiv:2212.10785.
- SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. CoRR, abs/2309.07445.
- MEGA: multilingual evaluation of generative AI. CoRR, abs/2303.12528.
- MEGAVERSE: benchmarking large language models across languages, modalities, models and tasks. CoRR, abs/2311.07463.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023.
- Instruct-align: Teaching novel languages with to LLMs through alignment-based cross-lingual instruction. CoRR, abs/2305.13627.
- Parsing with multilingual bert, a small treebank, and a small corpus. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1324–1334. Association for Computational Linguistics.
- Monolingual or multilingual instruction tuning: Which makes a better Alpaca. CoRR, abs/2309.08958.
- Improving language plasticity via pretraining with active forgetting. CoRR, abs/2307.01163.
- ELECTRA: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
- Efficient and effective text encoding for Chinese LLaMA and Alpaca. CoRR, abs/2304.08177.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages. CoRR, abs/2309.04679.
- Abteen Ebrahimi and Katharina Kann. 2021. How to adapt your pretrained multilingual model to 1600 languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 4555–4567. Association for Computational Linguistics.
- Fahim Faisal and Antonios Anastasopoulos. 2022. Phylogeny-inspired adaptation of multilingual models to new languages. CoRR, abs/2205.09634.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117, Toronto, Canada. Association for Computational Linguistics.
- Mistral 7b. CoRR, abs/2310.06825.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association for Computational Linguistics.
- MADLAD-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662.
- ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. CoRR, abs/2304.05613.
- Few-shot learning with multilingual language models. CoRR, abs/2112.10668.
- Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
- When being unseen from mbert is just the beginning: Handling new languages with multilingual language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 448–462. Association for Computational Linguistics.
- Can multilingual language models transfer to an unseen dialect? A case study on north african arabizi. CoRR, abs/2005.00318.
- Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, EACL 2021, Online, April 19-23, 2021, pages 80–90. Association for Computational Linguistics.
- MAD-X: an adapter-based framework for multi-task cross-lingual transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 7654–7673. Association for Computational Linguistics.
- Unks everywhere: Adapting multilingual language models to new scripts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 10186–10203. Association for Computational Linguistics.
- ZeRO: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
- DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
- BLOOM: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- mGPT: Few-shot learners go multilingual. CoRR, abs/2204.07580.
- UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
- LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Udapter: Language adaptation for truly universal dependency parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 2302–2315. Association for Computational Linguistics.
- Expanding pretrained models to thousands more languages via lexicon-based adaptation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 863–877. Association for Computational Linguistics.
- Extending multilingual BERT to low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 2649–2656. Association for Computational Linguistics.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
- A paradigm shift in machine translation: Boosting translation performance of large language models. CoRR, abs/2309.11674.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498.
- Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages. CoRR, abs/2305.18098.
- BLOOM+1: adding language support to BLOOM for zero-shot prompting. CoRR, abs/2212.09535.
- LLaMA beyond English: An empirical study on language capability transfer. arXiv preprint arXiv:2401.01055.
- Extrapolating large language models to non-english by aligning languages. CoRR, abs/2308.04948.
- Peiqin Lin (15 papers)
- Shaoxiong Ji (39 papers)
- Jörg Tiedemann (41 papers)
- André F. T. Martins (113 papers)
- Hinrich Schütze (250 papers)