Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions? (2402.13703v3)

Published 21 Feb 2024 in cs.CL

Abstract: The adaption of multilingual pre-trained LLMs into eloquent and helpful assistants is essential to facilitate their use across different language regions. In that spirit, we are the first to conduct an extensive study of the performance of multilingual models instruction-tuned on different language compositions on parallel instruction-tuning benchmarks across a selection of the most spoken Indo-European languages. We systematically examine the effects of language and instruction dataset size on a mid-sized and a large, multilingual LLMs by instruction-tuning them on parallel instruction-tuning datasets. Our results demonstrate that instruction-tuning on parallel instead of monolingual corpora benefits cross-lingual instruction following capabilities by up to 9.9%. Furthermore, we show that the Superficial Alignment Hypothesis does not hold in general, as the investigated multilingual 7B parameter model presents a counter-example requiring large-scale instruction-tuning datasets. Finally, we conduct a human annotation study to understand the alignment between human-based and GPT-4-based evaluation within multilingual chat scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Tokenizer choice for LLM training: Negligible or crucial? CoRR, abs/2310.08754.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
  3. MEEP: is this engaging? prompting large language models for dialogue evaluation in multilingual settings. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 2078–2100. Association for Computational Linguistics.
  4. Carlos Manuel Hidalgo-Ternero. 2020. Google translate vs. deepl. MonTI. Monografías de Traducción e Interpretación, pages 154–177.
  5. Oskar Holmström and Ehsan Doostmohammadi. 2023. Making instruction finetuning accessible to non-english languages: A case study on swedish models. In Proceedings of the 24th Nordic Conference on Computational Linguistics, NoDaLiDa 2023, Tórshavn, Faroe Islands, May 22-24, 2023, pages 634–642. University of Tartu Library.
  6. OPT-IML: scaling language model instruction meta learning through the lens of generalization. CoRR, abs/2212.12017.
  7. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.
  8. A few more examples may be worth billions of parameters. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1017–1029, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  9. Openassistant conversations - democratizing large language model alignment. CoRR, abs/2304.07327.
  10. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 66–75. Association for Computational Linguistics.
  11. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, pages 318–327. Association for Computational Linguistics.
  12. Bactrian-x : A multilingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011.
  13. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 2511–2522. Association for Computational Linguistics.
  14. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 22631–22648. PMLR.
  15. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 15991–16111. Association for Computational Linguistics.
  16. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  17. Language model tokenizers introduce unfairness between languages. CoRR, abs/2305.15425.
  18. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. Association for Computational Linguistics.
  19. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  20. Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning.
  21. Alexey Tikhonov and Max Ryabinin. 2021. It’s All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3534–3546, Online. Association for Computational Linguistics.
  22. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  23. Phoenix: Open-source language adaption for direct preference optimization.
  24. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics.
  25. FLASK: fine-grained language model evaluation based on alignment skill sets. CoRR, abs/2307.10928.
  26. Ahmad Yulianto and Rina Supriatnaningsih. 2021. Google translate vs. deepl: a quantitative evaluation of close-language pair translation (french to english). AJELP: Asian Journal of English Language and Pedagogy, 9(2):109–127.
  27. Instruction tuning for large language models: A survey. CoRR, abs/2308.10792.
  28. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685.
  29. LIMA: less is more for alignment. CoRR, abs/2305.11206.
  30. Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Alexander Arno Weber (4 papers)
  2. Klaudia Thellmann (4 papers)
  3. Jan Ebert (11 papers)
  4. Nicolas Flores-Herr (10 papers)
  5. Jens Lehmann (80 papers)
  6. Michael Fromm (24 papers)
  7. Mehdi Ali (11 papers)
Citations (2)