Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages (2402.12204v1)
Abstract: While LLMs have been pre-trained on multilingual corpora, their performance still lags behind in most languages compared to a few resource-rich languages. One common approach to mitigate this issue is to translate training data from resource-rich languages into other languages and then continue training. However, using the data obtained solely relying on translation while ignoring the original capabilities of LLMs across languages is not always effective, which we show will limit the performance of cross-lingual knowledge transfer. In this work, we propose SDRRL, a method based on Self-Distillation from Resource-Rich Languages that effectively improve multilingual performance by leveraging the internal capabilities of LLMs on resource-rich languages. We evaluate on different LLMs (LLaMA-2 and SeaLLM) and source languages across various comprehension and generation tasks, experimental results demonstrate that SDRRL can significantly enhance multilingual capabilities while minimizing the impact on original performance in resource-rich languages.
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants.
- xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning.
- Breaking language barriers in multilingual mathematical reasoning: Insights and observations.
- Distilling knowledge learned in bert for text generation.
- Improving pretrained cross-lingual language models via self-labeled word alignment.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
- Do multilingual language models think better in english?
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Mitchell A. Gordon and Kevin Duh. 2019. Explaining sequence-level knowledge distillation as data-augmentation for neural machine translation.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819.
- The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
- XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
- Distilling the knowledge in a neural network.
- Hiyouga. 2023. Llama factory. https://github.com/hiyouga/LLaMA-Factory.
- Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting.
- Zero-shot cross-lingual transfer of prompt-based tuning with a unified multilingual prompt.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Mixtral of experts.
- Turning english-centric llms into polyglots: How much multilinguality is needed?
- Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation.
- Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining.
- Align after pre-train: Improving multilingual generative models with cross-lingual alignment.
- Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation.
- Pretrained language models for text generation: A survey.
- Label supervised llama finetuning.
- Few-shot learning with multilingual language models.
- Pre-training multilingual neural machine translation by leveraging alignment information.
- Mkqa: A linguistically diverse benchmark for multilingual open domain question answering.
- Cpt: Cross-modal prefix-tuning for speech-to-text translation. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6217–6221.
- Zhuoyuan Mao and Yen Yu. 2024a. Tuning llms with contrastive alignment instructions for machine translation in unseen, low-resource languages.
- Zhuoyuan Mao and Yen Yu. 2024b. Tuning llms with contrastive alignment instructions for machine translation in unseen, low-resource languages.
- Seallms – large language models for southeast asia.
- Lextreme: A multi-lingual and multi-task benchmark for the legal domain. In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics.
- Jeroen Ooms. 2024. cld3: Google’s Compact Language Detector 3. R package version 1.6.0.
- OpenAI. 2022. ChatGPT. https://openai.com/chatgpt.
- OpenAI. 2023. GPT-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Saurabh Pahune and Manoj Chandrasekharan. 2023. Several categories of large language models (llms): A short survey. International Journal for Research in Applied Science and Engineering Technology, 11(7):615–633.
- Revisiting self-distillation.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Cross-lingual prompting: Improving zero-shot chain-of-thought reasoning across languages.
- Leonardo Ranaldi and Giulia Pucci. 2023. Does the english matter? elicit cross-lingual abilities of large language models. In Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL), pages 173–183.
- Leonardo Ranaldi and Fabio Massimo Zanzotto. 2023. Empowering multi-step reasoning across languages via tree-of-thoughts.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
- COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Advfusion: Multilingual adapter-based knowledge transfer for code summarization.
- Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing.
- Multilingual instruction tuning with just a pinch of multilinguality.
- Knowledge distillation for multilingual unsupervised neural machine translation.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
- Openchat: Advancing open-source language models with mixed-quality data.
- Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Andrea Wen-Yi and David Mimno. 2023. Hyperpolyglot llms: Cross-lingual interpretability in token embeddings. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Bloom: A 176b-parameter open-access multilingual language model.
- mt5: A massively multilingual pre-trained text-to-text transformer.
- Snapshot distillation: Teacher-student optimization in one generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Bigtranslate: Augmenting large language models with multilingual translation capability over 100 languages.
- Langbridge: Multilingual reasoning without multilingual supervision.
- WeChat neural machine translation systems for WMT21. In Proceedings of the Sixth Conference on Machine Translation, pages 243–254, Online. Association for Computational Linguistics.
- Improving massively multilingual neural machine translation and zero-shot translation.
- Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis; Machine Intelligence, 44(08):4388–4403.
- Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4388–4403.
- Be your own teacher: Improve the performance of convolutional neural networks via self distillation.
- Continual knowledge distillation for neural machine translation.
- Yuanchi Zhang and Yang Liu. 2021. Directquote: A dataset for direct quotation extraction and attribution in news articles. arXiv preprint arXiv:2110.07827.
- Llama beyond english: An empirical study on language capability transfer.
- Question translation training for better multilingual reasoning.
- Yuanchi Zhang (7 papers)
- Yile Wang (24 papers)
- Zijun Liu (17 papers)
- Shuo Wang (382 papers)
- Xiaolong Wang (243 papers)
- Peng Li (390 papers)
- Maosong Sun (337 papers)
- Yang Liu (2253 papers)