Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
The paper "Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement" presents Marco-LLM, a multilingual LLM developed by the MarcoPolo Team at Alibaba International Digital Commerce. This model addresses the limitations of conventional LLMs that predominantly excel with high-resource languages such as English. The paper proposes a robust framework aimed at enhancing the performance of LLMs across diverse linguistic landscapes, particularly focusing on low-resource languages.
Key Contributions
Marco-LLM employs a novel combination of two-stage continual pretraining and extensive post-training strategies to achieve remarkable cross-lingual capabilities:
- Two-Stage Continual Pretraining Strategy:
- Data Mixture Optimization: Initial training incorporates a balanced mix of multilingual data to maintain performance in high-resource languages while enhancing capabilities in low-resource languages. The initial stage employs a higher proportion of high-resource language data (49%) to prevent catastrophic forgetting, while the second stage increases the focus on low-resource languages (increased from 9% to 15% in data mixture).
- Learning Rate Management: The maximum learning rate is tuned to 1e-5 in the first stage to optimize multilingual acquisition without sacrificing existing capabilities. The second stage employs a decreased learning rate to facilitate these advanced capabilities further.
- Supervised Fine-Tuning and Preference Alignment:
- Data Collection and Multilingual Dataset Creation: The authors curate a large-scale training dataset comprising a variety of multilingual sources, including high-quality knowledge data and synthetic sources. They focus on adequate linguistic representation by leveraging efficient data-cleaning and deduplication techniques.
- Post-training with Sunken-Alignment and Preference Optimization: Post-training includes multilingual supervised finetuning and preference alignment methodologies to refine performance, especially enhancing the LLM's sensitivity to multilingual instructions and preferential feedback.
Evaluation and Results
Marco-LLM's performance is validated using a comprehensive array of multilingual benchmarks:
- General Knowledge and Multilingual Understanding: On MMMLU, Marco-LLM demonstrates superior performance across 29 languages, with an average score improvement of up to 5.9 points across multiple languages compared to other state-of-the-art models such as Qwen2 and Llama3. Marco-LLM also excels on benchmarks like AGIEval, Belebele, and CEval, achieving highest scores in several categories, reflecting its broad capabilities in handling complex language tasks.
- Machine Translation Tasks: The model showcases enhanced translation capabilities in both any-to-any and English-centric setups as evidenced by improvements on the Flores and multilingual MT benchmarks, outperforming comparable LLMs in cross-lingual tasks.
- Preference Alignment Evaluation: Marco-chat-7B achieved superior performance across all evaluated languages in the multilingual MT-bench with a win rate greater than loss rate across various languages demonstrating its adeptness in language comprehension and context adaptation.
Implications and Future Directions
This work exemplifies how targeted continual training and data strategies significantly improve LLM scalability and efficiency across different languages. Given its comprehensive approach to multilingual LLMing, Marco-LLM provides not only a versatile solution for diverse applications, particularly those bridging high and low-resource languages but also sets a standard for future research in LLM development.
The Marco-LLM framework demonstrates significant potential for the further inclusion of languages and exploration of multilingual reasoning capabilities, aiming for more enriched linguistic diversity. Additionally, enhancing model efficiency in low-resource settings will be paramount, likely leading to more scalable AI solutions deployable in diverse linguistic and cultural environments worldwide.