EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models (2409.17892v2)

Published 26 Sep 2024 in cs.CL

Abstract: In this work, we introduce EMMA-500, a large-scale multilingual LLM continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks. Our results highlight the effectiveness of continual pre-training in expanding LLMs' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability. We release the MaLA corpus, EMMA-500 model weights, scripts, and model generations.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a continually pre-trained, massively multilingual language model that excels in low-resource language tasks.
The methodology employs a diverse MaLA corpus with 74 billion tokens from 939 languages to ensure high-quality, balanced data representation.
The model outperforms baselines on benchmarks like Glot500-c and XCOPA, demonstrating significant advancements in cross-lingual capabilities.

EMMA-500: Enhancing Massively Multilingual Adaptation of LLMs

The paper introduces EMMA-500, a multilingual LLM trained on texts from 546 languages to enhance multilingual capabilities, especially for low-resource languages. This model leverages continual training on a curated collection called the MaLA corpus, comprising various text domains and an extensive linguistic representation. The developers refined the Llama 2 7B model through rigorous continual pre-training, yielding considerable performance improvements on diverse benchmarks.

Overview and Methodology

The primary goals of this research encompass:

Compilation of MaLA Corpus:
- The corpus contains approximately 74 billion tokens from 939 languages.
- Data sources span a wide range of domains, including web crawls, scientific papers, and books.
- Deduplication and extensive pre-processing were carried out to ensure data quality and diversity.
Continual Pre-Training:
- The MaLA corpus facilitated the continual pre-training of the Llama 2 7B model.
- Diverse data sources, including high-resource languages and code data, were incorporated to create a balanced and comprehensive training dataset.
- The model was trained on the Leonardo supercomputer using effective optimization and memory management techniques.

Evaluation Metrics and Benchmarks

The authors meticulously evaluated the model on several established benchmarks, involving:

Intrinsic Evaluation:
- Negative log-likelihood (NLL) was computed on Glot500-c and the Parallel Bible Corpus, revealing EMMA-500's superior performance.
Task-Specific Benchmarks:
- Tested across diverse tasks such as text classification, commonsense reasoning, machine translation, and open-ended generation.
- Benchmarks include SIB-200, Taxi-1500, XCOPA, FLORES-200, and the PolyWrite dataset.
Emergent Capabilities:
- Notably, EMMA-500 consistently emerged proficient in low-resource languages, demonstrating effective cross-lingual transfer and task generalization.
- The model's performance in high-resource languages, while comparable, did not exceed cutting-edge LLMs like Llama 3 and Gemma 2.

Numerical Results and Analysis

EMMA-500 surpassed various models, including those specifically tuned for multilingual tasks:

Intrinsic Evaluation:
- On Glot500-c, EMMA-500 achieved the lowest NLL, outperforming notable models like MaLA-500 and LLaMAX.
Commonsense Reasoning:
- The model also outperformed most baselines on XCOPA and XStoryCloze, demonstrating robust understanding across languages.
Machine Translation:
- FLORES-200 results confirmed EMMA-500's superiority in both X-to-English and English-to-X directions, indicating excellent multilingual generation capabilities.
Code Generation:
- EMMA-500 showcased competitive code generation abilities on the Multipl-E benchmark, avoiding catastrophic forgetting that affected other continual pre-trained models.

Practical and Theoretical Implications

The continual training approach validated by EMMA-500 underscores several critical insights:

Data Diversity and Quality:
- The inclusion of various data domains and types enhances the model’s adaptability across different tasks.
Low-Resource Language Performance:
- EMMA-500's success indicates the potential of continual pre-training to bridge resource gaps between high- and low-resource languages.
Multilingual Capabilities:
- The model serves as a benchmark for inclusive language representation, which is pivotal for global NLP applications.

Potential Directions for Future Research

Building on the advancements introduced by EMMA-500, future research could explore:

Enhanced Base Models:
- Adapting newer models like Llama 3 as the base for continual pre-training could yield even better results.
Native Multilingual Data:
- Developing native text datasets for low-resource languages to avoid biases introduced by translations.
Instruction Tuning:
- Incorporating instruction-tuning datasets to improve task-specific performances and broader applicability.

Conclusion

EMMA-500 represents a significant step towards equitable and effective multilingual LLMs. Its demonstrated strengths across various NLP benchmarks and resource categories highlight the efficacy of a well-curated, massively multilingual pre-training corpus. This work lays the groundwork for more sophisticated and inclusive LLMs, addressing a critical need in the field of computational linguistics.

PDF Markdown

Related Papers

Tweets

https://twitter.com/HelsinkiNLP/status/1840670510742413705