Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Published 21 Aug 2024 in cs.CL | (2408.11396v1)

Abstract: LLMs are often English-centric due to the disproportionate distribution of languages in their pre-training data. Enhancing non-English language capabilities through post-pretraining often results in catastrophic forgetting of the ability of original languages. Previous methods either achieve good expansion with severe forgetting or slight forgetting with poor expansion, indicating the challenge of balancing language expansion while preventing forgetting. In this paper, we propose a method called MoE-LPR (Mixture-of-Experts with Language Priors Routing) to alleviate this problem. MoE-LPR employs a two-stage training approach to enhance the multilingual capability. First, the model is post-pretrained into a Mixture-of-Experts (MoE) architecture by upcycling, where all the original parameters are frozen and new experts are added. In this stage, we focus improving the ability on expanded languages, without using any original language data. Then, the model reviews the knowledge of the original languages with replay data amounting to less than 1% of post-pretraining, where we incorporate language priors routing to better recover the abilities of the original languages. Evaluations on multiple benchmarks show that MoE-LPR outperforms other post-pretraining methods. Freezing original parameters preserves original language knowledge while adding new experts preserves the learning ability. Reviewing with LPR enables effective utilization of multilingual knowledge within the parameters. Additionally, the MoE architecture maintains the same inference overhead while increasing total model parameters. Extensive experiments demonstrate MoE-LPR's effectiveness in improving expanded languages and preserving original language proficiency with superior scalability. Code and scripts are freely available at https://github.com/zjwang21/MoE-LPR.git.

Citations (1)

Summary

  • The paper presents a two-stage MoE training strategy with language priors routing that enhances non-English performance while preventing catastrophic forgetting.
  • It preserves original language abilities by freezing key parameters and using less than 1% of initial language data for routing, ensuring stability in expanded languages like Greek and Turkish.
  • Experimental results show superior benchmark scores and scalability over methods like Full Fine-tuning, LoRA, and LLaMA-Pro with stable inference overhead.

MoE-LPR: Enhancing Multilingual Capabilities of LLMs

Introduction

The ubiquity of LLMs such as GPT and Llama has highlighted their immense capabilities across numerous tasks, yet their English-centric nature limits their performance in multilingual environments. Addressing the challenge of balancing language expansion with retention of existing capabilities without catastrophic forgetting, "MoE-LPR: Multilingual Extension of LLMs through Mixture-of-Experts with Language Priors Routing" (2408.11396) introduces MoE-LPR, a Mixture-of-Experts (MoE) architecture with Language Priors Routing (LPR). This approach aims to enhance non-English language capabilities efficiently while preserving proficiency in originally strong languages such as English.

Methodology

Mixture-of-Experts Architecture

MoE-LPR implements a two-stage training strategy crucial for expanding LLMs' multilingual capabilities. The first stage leverages the MoE architecture transformation by adding new experts while freezing existing parameters to preserve original language knowledge. The router selectively routes input tokens to appropriate experts, balanced by employing a load balancing loss to prevent routing collapse and ensure training stability. Figure 1

Figure 1: Overall framework of our MoE-LPR. Two-stage strategy is performed to enhance the multilingual capability.

Language Priors Routing

To address forgetting during language expansion, MoE-LPR applies Language Priors Routing in the review stage. It utilizes less than 1% of the initial language data, prioritizing the original frozen expert for routing original language tokens via the LPR loss. This approach, distinct from traditional replay methods that require large corpora, elegantly maintains transformation efficacy without diminishing expanded language performance. Figure 2

Figure 2: MoE-LPR performs the best in both expanded languages and original languages. We define expanded languages as languages that the model is not very good at and we are going to enhance, and original languages as languages that the model is relatively strong in and prone to catastrophic forgetting.

Experimental Results

Comparative Performance

MoE-LPR's effectiveness was validated across multiple benchmarks, demonstrating superior performance in expanded languages such as Greek, Hungarian, and Turkish, while robustly preserving abilities in English, Chinese, and Spanish. It consistently achieved higher scores compared to other baselines such as Full Fine-tuning, LoRA, and LLaMA-Pro, by up to 2.7 points in expanded and 0.88 points in original languages. Figure 3

Figure 3: Router scores of the frozen expert for English (original language) tokens in the Belebele benchmark.

Scaling and Generalization

Unlike other methods whose inference overhead increases linearly with added parameters, MoE-LPR maintains a stable overhead while enhancing capabilities, showcasing advanced scalability and cost-effectiveness. Additionally, MoE-LPR exhibits strong generalization by effectively preventing catastrophic forgetting in untrained languages such as French and Portuguese. Figure 4

Figure 4: Average scores in expanded and original languages with varying numbers of documents for review.

Conclusion

MoE-LPR offers a promising method for expanding multilingual capabilities of LLMs, effectively balancing between enhancement of new language proficiencies and retention of original abilities. Its scalable, resource-efficient strategy, coupled with effective language generalization, positions MoE-LPR as a valuable contribution to advancing multilingual NLP technologies in a globally diverse environment. Future developments could explore further deployments, optimizations, and enhanced expert integration to deepen LLM proficiency across broader language spectrums.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.