Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation (2305.15011v2)

Published 24 May 2023 in cs.CL

Abstract: Instruction tuning has shown great promise in improving the performance of LLMs. However, research on multilingual instruction tuning has been limited due to the scarcity of high-quality instruction-response datasets across different languages. To bridge this gap, we present Bactrian-X, a comprehensive multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. Leveraging this dataset, we train a set of adapters using low-rank adaptation (LoRA), which are lightweight components that seamlessly integrate with LLMs. These adapters have a substantially lower parameter count than the base model, making them easily replaceable and usable as plug-ins for different languages or language groups. Extensive experiments in various multilingual evaluation settings demonstrate that models derived from LoRA-based training over Bactrian-X outperform both the vanilla models and existing instruction-tuned models. The code and models are publicly available at https://github.com/mbzuai-nlp/bactrian-x

Bactrian-X: Advancing Multilingual Instruction-Following Models

The paper "Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation" details the development of Bactrian-X, a substantial multilingual dataset containing 3.4 million instruction-response pairs spanning 52 languages, aimed at enhancing the multilingual capabilities of LLMs through instruction tuning. The paper leverages Low-Rank Adaptation (LoRA) to efficiently fine-tune LLMs with this dataset, providing insights into lightweight adaptation methodologies for multilingual contexts.

Key Contributions

The paper makes several notable contributions to the field of multilingual AI and NLP:

  1. Multilingual Instruction Dataset: The introduction of Bactrian-X comprises diverse, automatically translated instructions based on existing English datasets like Alpaca and Dolly, using the Google Translate API, supplemented by responses generated via ChatGPT. This dataset addresses the longstanding challenge of multilingual generalization in instruction tuning.
  2. Parameter-Efficient Fine-Tuning: Through the innovative use of LoRA, models are fine-tuned using adapters with a reduced parameter count, allowing for seamless integration with existing LLMs like BLOOM and LLaMA, without the burden of full model update.
  3. Evaluation and Results: Bactrian-X models outperform vanilla and other instruction-tuned models across multiple zero-shot tasks in language understanding, such as XCOPA and Sentiment analysis. A more robust performance is observed using larger models like LLaMA with 13B parameters.
  4. Open-Ended Question Assessment: Employing GPT-4 as an evaluator for open-ended question generation tasks, this research demonstrates that Bactrian-X models offer significant improvements over other models like Alpaca and BLOOMZ, particularly in adapting to unseen languages or domains.

Implications and Future Directions

This paper highlights the potential of multilingual instruction datasets to enhance LLMs' abilities across diverse languages. The advent of Bactrian-X emphasizes the importance of increased multilingual training data and efficient adaptation techniques like LoRA in expanding the capabilities of LLMs.

  • Practical Implications: The practicality of LoRA integration suggests potential scaling to more languages beyond those seen in pre-training, thereby broadening applicability in global NLP applications.
  • Theoretical Directions: Future research might explore the extension of this methodology to different model architectures, gauging the efficacy of such instruction-following models in various linguistic and cultural contexts.
  • AI Advancements: The paper offers a framework for future improvements in AI generalizability by focusing on multilingual readiness and efficiency, a necessity as AI systems are increasingly deployed globally.

In summary, the Bactrian-X dataset and corresponding model innovations present substantial progress in NLP by equipping LLMs with more adaptable, multilingual capabilities. Through a focus on efficiency and scope, this work sets a precedent for multi-faceted growth in multilingual AI research, aiming to provide more equitable LLM capabilities across diverse linguistic landscapes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Mawqif: A multi-label Arabic dataset for target-specific stance detection. In Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP), pages 174–184, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  2. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, Dublin, Ireland. Association for Computational Linguistics.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  5. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  6. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  7. Free dolly: Introducing the world’s first truly open instruction-tuned llm. https://www.databricks.com.
  8. Parameter-efficient transfer learning with diff pruning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4884–4896, Online. Association for Computational Linguistics.
  9. EXAMS: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 5427–5444. Association for Computational Linguistics.
  10. Yuta Hayashibe. 2020. Japanese realistic textual entailment corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6827–6834, Marseille, France. European Language Resources Association.
  11. Training compute-optimal large language models. CoRR, abs/2203.15556.
  12. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
  13. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  14. IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics, pages 757–770, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  15. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  16. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  17. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  18. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California. Association for Computational Linguistics.
  19. Crosslingual generalization through multitask finetuning. CoRR, abs/2211.01786.
  20. SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). Association for Computational Linguistics.
  21. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  22. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  24. XCOPA: A multilingual dataset for causal commonsense reasoning. arXiv preprint.
  25. Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark. Association for Computational Linguistics.
  26. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  27. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  28. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  29. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  30. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401.
  31. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  32. Alexey Tikhonov and Max Ryabinin. 2021. It’s All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3534–3546, Online. Association for Computational Linguistics.
  33. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  34. Attention is all you need. Advances in neural information processing systems, 30.
  35. Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560.
  36. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  37. NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
  38. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  39. Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402.
  40. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  41. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haonan Li (43 papers)
  2. Fajri Koto (47 papers)
  3. Minghao Wu (31 papers)
  4. Alham Fikri Aji (94 papers)
  5. Timothy Baldwin (125 papers)
Citations (71)
Github Logo Streamline Icon: https://streamlinehq.com