Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation (2305.15011v2)

Published 24 May 2023 in cs.CL

Abstract: Instruction tuning has shown great promise in improving the performance of LLMs. However, research on multilingual instruction tuning has been limited due to the scarcity of high-quality instruction-response datasets across different languages. To bridge this gap, we present Bactrian-X, a comprehensive multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. Leveraging this dataset, we train a set of adapters using low-rank adaptation (LoRA), which are lightweight components that seamlessly integrate with LLMs. These adapters have a substantially lower parameter count than the base model, making them easily replaceable and usable as plug-ins for different languages or language groups. Extensive experiments in various multilingual evaluation settings demonstrate that models derived from LoRA-based training over Bactrian-X outperform both the vanilla models and existing instruction-tuned models. The code and models are publicly available at https://github.com/mbzuai-nlp/bactrian-x

PDF HTML Abstract

Bactrian-X: Advancing Multilingual Instruction-Following Models

The paper "Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation" details the development of Bactrian-X, a substantial multilingual dataset containing 3.4 million instruction-response pairs spanning 52 languages, aimed at enhancing the multilingual capabilities of LLMs through instruction tuning. The paper leverages Low-Rank Adaptation (LoRA) to efficiently fine-tune LLMs with this dataset, providing insights into lightweight adaptation methodologies for multilingual contexts.

Key Contributions

The paper makes several notable contributions to the field of multilingual AI and NLP:

Multilingual Instruction Dataset: The introduction of Bactrian-X comprises diverse, automatically translated instructions based on existing English datasets like Alpaca and Dolly, using the Google Translate API, supplemented by responses generated via ChatGPT. This dataset addresses the longstanding challenge of multilingual generalization in instruction tuning.
Parameter-Efficient Fine-Tuning: Through the innovative use of LoRA, models are fine-tuned using adapters with a reduced parameter count, allowing for seamless integration with existing LLMs like BLOOM and LLaMA, without the burden of full model update.
Evaluation and Results: Bactrian-X models outperform vanilla and other instruction-tuned models across multiple zero-shot tasks in language understanding, such as XCOPA and Sentiment analysis. A more robust performance is observed using larger models like LLaMA with 13B parameters.
Open-Ended Question Assessment: Employing GPT-4 as an evaluator for open-ended question generation tasks, this research demonstrates that Bactrian-X models offer significant improvements over other models like Alpaca and BLOOMZ, particularly in adapting to unseen languages or domains.

Implications and Future Directions

This paper highlights the potential of multilingual instruction datasets to enhance LLMs' abilities across diverse languages. The advent of Bactrian-X emphasizes the importance of increased multilingual training data and efficient adaptation techniques like LoRA in expanding the capabilities of LLMs.

Practical Implications: The practicality of LoRA integration suggests potential scaling to more languages beyond those seen in pre-training, thereby broadening applicability in global NLP applications.
Theoretical Directions: Future research might explore the extension of this methodology to different model architectures, gauging the efficacy of such instruction-following models in various linguistic and cultural contexts.
AI Advancements: The paper offers a framework for future improvements in AI generalizability by focusing on multilingual readiness and efficiency, a necessity as AI systems are increasingly deployed globally.

In summary, the Bactrian-X dataset and corresponding model innovations present substantial progress in NLP by equipping LLMs with more adaptable, multilingual capabilities. Through a focus on efficiency and scope, this work sets a precedent for multi-faceted growth in multilingual AI research, aiming to provide more equitable LLM capabilities across diverse linguistic landscapes.

PDF Markdown Bookmark Chat (Pro)

References (41)

Authors (5)

Haonan Li (43 papers)
Fajri Koto (47 papers)
Minghao Wu (31 papers)
Alham Fikri Aji (94 papers)
Timothy Baldwin (125 papers)

Citations (71)

View on Semantic Scholar

GitHub

GitHub - mbzuai-nlp/bactrian-x: A Multilingual Replicable Instruction-Following Model (93 stars)

Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation (2305.15011v2)

Bactrian-X: Advancing Multilingual Instruction-Following Models

Key Contributions

Implications and Future Directions

Related Papers

GitHub