Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral (2403.01851v1)

Published 4 Mar 2024 in cs.CL and cs.AI
Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral

Abstract: Mixtral, a representative sparse mixture of experts (SMoE) LLM, has received significant attention due to its unique model design and superior performance. Based on Mixtral-8x7B-v0.1, in this paper, we propose Chinese-Mixtral and Chinese-Mixtral-Instruct with improved Chinese language abilities by adopting further pre-training and instruction fine-tuning. Experimental results show that our Chinese-Mixtral and Chinese-Mixtral-Instruct successfully improve Chinese understanding and generation performance while retaining the original English abilities. Then, we discuss several key questions when performing language adaptation on LLMs, including the necessity of extending the language-specific vocabulary and the choice of the initialization model (foundation model v.s. instruction model), by providing empirical results and analysis. We also present the visualizations of each expert to examine their importance on downstream tasks. Our resources are publicly available through \url{https://github.com/ymcui/Chinese-Mixtral}.

Enhancing Chinese Language Performance in Mixtral Models Without Vocabulary Extension

Introduction to Chinese Mixtral

The advent of Mixtral, a sparse mixture of experts (SMoE) LLM, marks a significant step forward in the field of NLP. This paper extends Mixtral's capabilities into the Chinese language domain, introducing Chinese-Mixtral and Chinese-Mixtral-Instruct models. These versions uphold Mixtral's original architectural integrity while enhancing its performance on Chinese language tasks, including understanding and generation, without extending the model's vocabulary. The models retain their English language proficiency, offering a bilingual solution. Crucially, the paper examines key considerations in language adaptation for LLMs, such as the impact of language-specific vocabulary and the choice of initiation model (foundation vs. instruction model).

The Architecture and Training of Chinese Mixtral

Chinese Mixtral retains the original architectural specifications of Mixtral, employing the same transformer model foundation but specializing in handling Chinese language tasks. The model utilizes a Sparse Mixture-of-Expert (SMoE) layer, incorporating eight distinct "experts" or groups of parameters selectively activated during processing. This structure enables efficient parameter use and optimizes computational resource allocation. Training incorporates an auxiliary load balancing loss to ensure even routing among experts, addressing potential skewness in parameter utilization. The training utilized QLoRA methodology for embedding and LM head training, fostering an efficient learning environment for the Chinese adaptation.

Experimental Insights and Results

The effectiveness of Chinese-Mixtral and its instruction-tuned counterpart was verified through a comprehensive suite of benchmarks and evaluations. Despite not expanding the original Mixtral's vocabulary, the models demonstrated superior performance on various Chinese datasets, including C-Eval and CMMLU, illustrating their robust understanding and generative capabilities in both English and Chinese contexts. Notably, instruction fine-tuning on Chinese-Mixtral-Instruct significantly enhanced performance across tasks, underscoring the value of specialized fine-tuning in cross-lingual LLM adaptation.

Key Findings and Considerations

This investigation sheds light on several critical aspects of adapting LLMs to new languages:

  • Vocabulary Extension: Contrary to common practice, extending the model's vocabulary with language-specific tokens was found not to be essential for achieving high performance in language-specific tasks. This finding suggests that the encoding efficiency provided by vocabulary extension may not translate into better model performance on downstream tasks.
  • Choice of Initialization Model: The paper suggests a preference for using the foundation model as the starting point for language adaptation over an instruction-tuned model. This approach appears to better preserve the model's comprehensive language abilities and facilitates effective language transfer.
  • Long-Context Abilities: Interestingly, Mixtral models, including the Chinese adaptations, demonstrated an inherent ability to handle context lengths beyond their specified design, indicating a versatile long-context capacity that may negate the need for additional fine-tuning for long-context handling.

Visualization and Expert Analysis

The paper presents an innovative visualization analysis, highlighting the distinct roles and importance of each expert within the model, especially in processing Chinese language tasks. This analysis offers intriguing insights into the inner workings of the SMoE architecture, revealing the intricate balance and specialization among experts that contribute to the model's overall performance.

Conclusion and Future Directions

The development of Chinese-Mixtral and Chinese-Mixtral-Instruct represents a significant advancement in the adaptation of LLMs for Chinese language processing. These models maintain efficiency and performance without necessitating vocabulary extension, challenging prevailing assumptions in the field. The insights gleaned on initialization models and the inherent long-context abilities of Mixtral open new avenues for research and application in multilingual NLP. By making these resources publicly available, this work encourages further exploration and collaboration within the open-source community, promising continued innovation in LLM adaptation and beyond.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Gradio: Hassle-free sharing and testing of ML models in the wild, 2019. URL https://arxiv.org/abs/1906.02569.
  2. Longbench: A bilingual, multitask benchmark for long context understanding. ArXiv preprint, abs/2308.14508, 2023. URL https://arxiv.org/abs/2308.14508.
  3. Extending context window of large language models via positional interpolation. ArXiv preprint, abs/2306.15595, 2023. URL https://arxiv.org/abs/2306.15595.
  4. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv preprint, abs/1803.05457, 2018. URL https://arxiv.org/abs/1803.05457.
  5. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
  6. Multilingual multi-aspect explainability analyses on machine reading comprehension models. iScience, 25(5):104176, 2022. ISSN 2589-0042. doi: https://doi.org/10.1016/j.isci.2022.104176. URL https://www.sciencedirect.com/science/article/pii/S2589004222004461.
  7. Efficient and effective text encoding for chinese llama and alpaca. ArXiv preprint, abs/2304.08177, 2023. URL https://arxiv.org/abs/2304.08177.
  8. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  9. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  10. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  11. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. ArXiv preprint, abs/2305.08322, 2023. URL https://arxiv.org/abs/2305.08322.
  12. Mixtral of experts. ArXiv preprint, abs/2401.04088, 2024. URL https://arxiv.org/abs/2401.04088.
  13. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  785–794, Copenhagen, Denmark, 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082.
  14. Cmmlu: Measuring massive multitask language understanding in chinese, 2023.
  15. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  16. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  17. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  18. OpenAI. GPT-4 Technical Report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
  19. Yarn: Efficient context window extension of large language models. ArXiv preprint, abs/2309.00071, 2023. URL https://arxiv.org/abs/2309.00071.
  20. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  21. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (eds.), KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pp.  3505–3506. ACM, 2020. URL https://dl.acm.org/doi/10.1145/3394486.3406703.
  22. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  8732–8740. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/6399.
  23. Noam Shazeer. Glu variants improve transformer, 2020.
  24. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  25. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971.
  26. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https://arxiv.org/abs/2307.09288.
  27. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  28. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yiming Cui (80 papers)
  2. Xin Yao (139 papers)
Citations (3)
Github Logo Streamline Icon: https://streamlinehq.com