Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement (2412.04003v1)

Published 5 Dec 2024 in cs.CL

Abstract: LLMs have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.

PDF HTML Abstract

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

The paper "Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement" presents Marco-LLM, a multilingual LLM developed by the MarcoPolo Team at Alibaba International Digital Commerce. This model addresses the limitations of conventional LLMs that predominantly excel with high-resource languages such as English. The paper proposes a robust framework aimed at enhancing the performance of LLMs across diverse linguistic landscapes, particularly focusing on low-resource languages.

Key Contributions

Marco-LLM employs a novel combination of two-stage continual pretraining and extensive post-training strategies to achieve remarkable cross-lingual capabilities:

Two-Stage Continual Pretraining Strategy:
- Data Mixture Optimization: Initial training incorporates a balanced mix of multilingual data to maintain performance in high-resource languages while enhancing capabilities in low-resource languages. The initial stage employs a higher proportion of high-resource language data (49%) to prevent catastrophic forgetting, while the second stage increases the focus on low-resource languages (increased from 9% to 15% in data mixture).
- Learning Rate Management: The maximum learning rate is tuned to 1e-5 in the first stage to optimize multilingual acquisition without sacrificing existing capabilities. The second stage employs a decreased learning rate to facilitate these advanced capabilities further.
Supervised Fine-Tuning and Preference Alignment:
- Data Collection and Multilingual Dataset Creation: The authors curate a large-scale training dataset comprising a variety of multilingual sources, including high-quality knowledge data and synthetic sources. They focus on adequate linguistic representation by leveraging efficient data-cleaning and deduplication techniques.
- Post-training with Sunken-Alignment and Preference Optimization: Post-training includes multilingual supervised finetuning and preference alignment methodologies to refine performance, especially enhancing the LLM's sensitivity to multilingual instructions and preferential feedback.

Evaluation and Results

Marco-LLM's performance is validated using a comprehensive array of multilingual benchmarks:

General Knowledge and Multilingual Understanding: On MMMLU, Marco-LLM demonstrates superior performance across 29 languages, with an average score improvement of up to 5.9 points across multiple languages compared to other state-of-the-art models such as Qwen2 and Llama3. Marco-LLM also excels on benchmarks like AGIEval, Belebele, and CEval, achieving highest scores in several categories, reflecting its broad capabilities in handling complex language tasks.
Machine Translation Tasks: The model showcases enhanced translation capabilities in both any-to-any and English-centric setups as evidenced by improvements on the Flores and multilingual MT benchmarks, outperforming comparable LLMs in cross-lingual tasks.
Preference Alignment Evaluation: Marco-chat-7B achieved superior performance across all evaluated languages in the multilingual MT-bench with a win rate greater than loss rate across various languages demonstrating its adeptness in language comprehension and context adaptation.

Implications and Future Directions

This work exemplifies how targeted continual training and data strategies significantly improve LLM scalability and efficiency across different languages. Given its comprehensive approach to multilingual LLMing, Marco-LLM provides not only a versatile solution for diverse applications, particularly those bridging high and low-resource languages but also sets a standard for future research in LLM development.

The Marco-LLM framework demonstrates significant potential for the further inclusion of languages and exploration of multilingual reasoning capabilities, aiming for more enriched linguistic diversity. Additionally, enhancing model efficiency in low-resource settings will be paramount, likely leading to more scalable AI solutions deployable in diverse linguistic and cultural environments worldwide.

PDF Markdown Bookmark Chat (Pro)

Authors (20)

Lingfeng Ming (6 papers)
Bo Zeng (41 papers)
Chenyang Lyu (44 papers)
Tianqi Shi (9 papers)
Yu Zhao (207 papers)
Xue Yang (141 papers)
Yefeng Liu (5 papers)
Yiyu Wang (15 papers)
Linlong Xu (5 papers)
Yangyang Liu (35 papers)
Xiaohu Zhao (4 papers)
Hao Wang (1119 papers)
Heng Liu (27 papers)
Hao Zhou (351 papers)
Huifeng Yin (8 papers)
Zifu Shang (3 papers)
Haijun Li (12 papers)
Longyue Wang (87 papers)
Weihua Luo (63 papers)
Kaifu Zhang (28 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/wangly0229/status/1864926629715116082

https://twitter.com/_akhaliq/status/1864931784829403351

https://twitter.com/rohanpaul_ai/status/1865900411061964881

https://twitter.com/fly51fly/status/1865169082020171842

https://twitter.com/gm8xx8/status/1864926365994160241

https://twitter.com/arXivGPT/status/1865823003012936105