Medical LLMs have the potential to improve global healthcare access, but deploying them in local languages, especially low-resource ones, is hindered by data scarcity. This paper, "Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts" (Zheng et al., 14 Oct 2024 ), addresses these challenges by proposing efficient methods for scaling medical LLMs to a large number of languages.
The authors first construct a high-quality medical dataset covering 12 major languages, drawing from diverse sources like books, papers, encyclopedias, dialogues, exams, websites, and practical guidelines. They use methods like ChatGPT for data processing (converting texts to QA pairs) and employ quality checks, including monolingual training and ablation studies (e.g., confirming the value of including math and code data for reasoning). This dataset significantly improves the performance of fine-tuned dense models on medical benchmarks across the 12 languages.
To improve efficiency and scalability for multilingual models, the paper explores the use of Mixture of Experts (MoE) architectures. They propose Hybrid- routing, a novel routing strategy for MoE layers that combines language-specific experts with cross-lingual routing. This approach aims to leverage language-dependent knowledge while also enabling the transfer of general medical knowledge across languages. Hybrid- routing ensures that the expert corresponding to the input token's language is activated, while also allowing dynamic routing to other experts based on the router's scoring, potentially replacing lower-scoring vanilla Top- experts. Experiments show that MoE models with Hybrid- routing achieve better performance and generalization to minor languages compared to dense models and MoE models with vanilla Top- or strict Language-Specific routing.
The paper explores interpreting the multilingual information flow within the MoE using a circuit-based paradigm. By analyzing how tokens from different languages are routed to experts across layers, they observe a phenomenon called "Spread Out in the End." This refers to the observation that earlier layers exhibit shared routing patterns across languages, indicating cross-lingual integration, while later layers show language-specific divergence, with tokens primarily routed to experts specializing in their respective languages.
Inspired by this "Spread Out in the End" phenomenon, the authors propose the Post-MoE architecture. This architecture applies sparse MoE layers only in the final layers of the model, while keeping earlier layers dense. This design choice leverages the observed specialization in later layers while maintaining efficient processing in earlier, more cross-lingual layers. Experiments with different base models (Qwen2-0.5B and Qwen2-1.5B) show that applying MoE in the last few layers (specifically, the last two layers yielded the best balance in their experiments) significantly improves performance, particularly multilingual generalization.
Building upon the Post-MoE architecture, the paper introduces an efficient method for scaling to 50 languages without a proportional increase in model parameters. They group the 50 languages into 7 language families based on linguistic priors and propose using Mixture of Language Family Experts. Instead of having an expert per language, the MoE layers feature experts dedicated to language families. Tokens from languages within a family are routed to the corresponding language family expert, still utilizing the Hybrid- routing mechanism within this family context. For low-resource minor languages, the training data is synthesized by translating English medical data.
The resulting models, named Apollo-MoE (based on Qwen2-0.5B, 1.5B, and 7B base models), are evaluated on a benchmark covering 12 major and 38 minor languages (medical-clinical subset of MMLU translated using Google Translate for minor languages). The results demonstrate that Apollo-MoE models outperform other open-source medical LLMs of similar sizes on both major and minor languages. The 10B Apollo-MoE model achieved particularly strong results, exceeding 69% accuracy on major languages and 58% on minor languages, surpassing larger 8B open-source models. The method is shown to be relatively data-efficient for minor languages, achieving saturation with around 2,000 translated samples per language.
Practical Implementation Considerations:
- Data Curation: The process involves collecting data from diverse medical sources and using LLMs (like GPT-3.5-turbo) for reformatting and enhancing quality (e.g., generating QA pairs). Implementing this requires careful data sourcing, cleaning, and prompt engineering for Q&A generation. Data leakage checks are crucial during this phase.
- MoE Integration: Implementing MoE layers involves modifying the standard transformer architecture, replacing Feed-Forward Networks (FFNs) with MoE blocks. Sparse Upcycling (Komatsuzaki et al., 2022 ) can be used to initialize MoE layers from pre-trained dense models efficiently.
- Hybrid Routing: The Hybrid- routing logic needs to be implemented within the MoE layer. This involves:
- Identifying the language of the input token (e.g., using a language identification tool or pre-computed labels per document).
- Computing router scores for all experts (similar to vanilla Top-).
- Identifying the language-specific expert(s).
- Selecting the top- experts, ensuring the language-specific expert(s) are included, potentially replacing lower-scoring experts if they weren't in the initial Top-.
Post-MoE Architecture: This requires strategically placing MoE layers only in the final transformer blocks (e.g., the last 2-4 layers as explored in the paper). The choice of how many layers to make sparse might depend on the base model size and should potentially be tuned.
- Language Family Experts: For scaling to many languages, define language families based on linguistic knowledge. Implement the MoE layer with experts corresponding to these families. The Hybrid- routing would then operate within the context of routing to these family experts.
- Training: Fine-tuning requires significant computational resources (e.g., 8x A800 GPUs). The training involves optimizing the router weights and the expert weights. Careful tuning of learning rates, especially for the router, is necessary.
- Evaluation: For low-resource languages, relying on machine translation of benchmarks (like MMLU) is a practical approach when dedicated benchmarks are unavailable. This requires setting up a translation pipeline and prompt-based evaluation.
- Deployment: MoE models offer computational advantages during inference by only activating a subset of parameters per token. However, they can have higher memory requirements than dense models of comparable activated parameters due to loading all experts. The Post-MoE architecture might offer a balance here.
The paper demonstrates a practical path towards building medical LLMs that can serve a wide range of languages efficiently, leveraging MoE architectures and insights into multilingual information flow. While achieving parity with the largest closed-source models remains a goal, the proposed techniques provide a strong foundation for democratizing access to medical AI.