Aurora:Activating Chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through Instruction-Tuning (2312.14557v2)

Published 22 Dec 2023 in cs.CL

Abstract: Existing research has demonstrated that refining LLMs through the utilization of machine-generated instruction-following data empowers these models to exhibit impressive zero-shot capabilities for novel tasks, without requiring human-authored instructions. In this paper, we systematically investigate, preprocess, and integrate three Chinese instruction-following datasets with the aim of enhancing the Chinese conversational capabilities of Mixtral-8x7B sparse Mixture-of-Experts model. Through instruction fine-tuning on this carefully processed dataset, we successfully construct the Mixtral-8x7B sparse Mixture-of-Experts model named "Aurora." To assess the performance of Aurora, we utilize three widely recognized benchmark tests: C-Eval, MMLU, and CMMLU. Empirical studies validate the effectiveness of instruction fine-tuning applied to Mixtral-8x7B sparse Mixture-of-Experts model. This work is pioneering in the execution of instruction fine-tuning on a sparse expert-mixed model, marking a significant breakthrough in enhancing the capabilities of this model architecture. Our code, data and model are publicly available at https://github.com/WangRongsheng/Aurora

PDF HTML Abstract

Analyzing the Instruction-Tuning Methodology for Enhancing Chinese Conversational Capabilities in Mixtral-8x7B

The paper "Aurora: Activating Chinese Chat Capability for Mixtral-8x7B Sparse Mixture-of-Experts through Instruction-Tuning" represents a significant contribution to the ongoing research in maximizing the potential of LLMs for multilingual applications, particularly focusing on Chinese conversational tasks. The authors meticulously explore the enhancement of Mixtral-8x7B, a sparse Mixture-of-Experts (MoE) model, by leveraging instruction-tuning techniques to improve its zero-shot capabilities for engaging in Chinese-based dialogue.

Core Contributions and Methodology

The research introduces a systematic approach to extending the Chinese conversational capabilities of the Mixtral-8x7B model. Notably, this model is composed of eight experts, each with seven billion parameters, and is engineered to select two experts dynamically for processing input tokens, optimizing computational efficiency. To address limitations in native Chinese task processing, this paper adds value through several key contributions:

Dataset Integration and Fine-Tuning: The authors compile and preprocess three distinct Chinese instruction-following datasets: alpaca_data_zh_51k, alpaca_gpt4_data_zh, and sharegpt_70k. These datasets enable the fine-tuning of Mixtral-8x7B to better align with Chinese dialogues. Integration of these datasets is crucial; they are subjected to rigorous cleaning and organized to support multi-domain, high-quality conversational instances. The overall dataset comprises 176,678 interactions.
Model Development and Evaluation: The fine-tuned Mixtral-8x7B, named "Aurora," undergoes evaluation against notable benchmarks including C-Eval, MMLU, and CMMLU. These benchmarks span various subjects and difficulty levels, ensuring robust testing of Aurora's capabilities. Crucially, the empirical results showcase significant improvements in Aurora's performance, particularly in its ability to process and respond to Chinese dialogue prompts.
Novel Instruction-Tuning Application: This work pioneers the execution of instruction-tuning on a sparse expert-mixed model. The approach utilizes a Low-Rank Adaptation (LoRA) strategy to efficiently update model weights while minimizing GPU memory usage, facilitated by 4-bit matrix operations. This methodology substantiates the application of instruction-tuning to sparse models, thereby expanding their applicability to diverse linguistic contexts.

Implications and Future Directions

Aurora's enhancements highlight the practical utility of instruction-tuning sparse MoE models for language-specific tasks. By adopting comprehensive datasets and utilizing efficient weight adaptation techniques, Aurora achieves competitive performance across diverse linguistic benchmarks. The paper sets a precedent for future exploration and development of multilingual capabilities within sparse models, encouraging the development of LLMs like Aurora that align with human interaction requirements more effectively.

From a theoretical perspective, this paper supports the growing body of evidence that instruction-tuning significantly augments LLMs' generalization abilities. It invites speculation that future advancements in this domain could include dynamically adaptive models capable of real-time multilingual translation and interaction. The paper elucidates a promising trajectory for enhancing LLMs' capabilities through efficient resource optimization and effective utilization of localized datasets.

Overall, this research not only advances the field of multilingual LLM applications but also paves the way for more sophisticated implementations of instruction-tuning methodologies, fostering greater inclusivity in natural language processing across diverse linguistic landscapes.

PDF Markdown Bookmark Chat (Pro)

References (19)

Authors (11)

Rongsheng Wang (16 papers)
Haoming Chen (17 papers)
Ruizhe Zhou (2 papers)
Yaofei Duan (5 papers)
Kunyan Cai (3 papers)
Han Ma (33 papers)
Jiaxi Cui (13 papers)
Jian Li (667 papers)
Patrick Cheong-Iao Pang (6 papers)
Yapeng Wang (10 papers)
Tao Tan (54 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - WangRongsheng/Aurora: 🐳 Aurora is a [Chinese Version] MoE model. Aurora is a further work based on Mixtral-8x7B, which activates the chat capability of the model's Chinese open domain. (256 stars)