CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment (2404.11932v2)
Abstract: Multilingual proficiency presents a significant challenge for LLMs. English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared by various languages to efficiently enhance the model's task-solving capabilities and multilingual proficiency within a single process. In addition, we introduce a multi-task and multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental results demonstrate that our method substantially improves performance across tasks and languages, and we provide extensive insights into the impact of cross-lingual data volume and the integration of translation data on enhancing multilingual consistency and accuracy.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- On the cross-lingual transferability of monolingual representations. CoRR, abs/1910.11856.
- Terra Blevins and Luke Zettlemoyer. 2022. Language contamination helps explain the cross-lingual capabilities of english pretrained models.
- Scaling instruction-finetuned language models.
- Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv e-prints, pages arXiv–2307.
- Sailor: Open language models for south-east asia.
- From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models.
- Multilingual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly.
- Measuring massive multitask language understanding.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Turning english-centric llms into polyglots: How much multilinguality is needed?
- Platypus: Quick, cheap, and powerful refinement of llms.
- Mlqa: Evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475.
- Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1487–1505.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- An empirical study of catastrophic forgetting in large language models during continual fine-tuning.
- A survey on bias and fairness in machine learning.
- Crosslingual generalization through multitask finetuning.
- A comprehensive overview of large language models.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Cross-lingual consistency of factual knowledge in multilingual language models.
- Multilingual large language model: A survey of resources, taxonomy and frontiers.
- Semqa: Semi-extractive multi-source question answering.
- Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149.
- Multilingual instruction tuning with just a pinch of multilinguality.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models.
- Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning.
- Finetuned language models are zero-shot learners.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Multilingual machine translation with large language models: Empirical results and analysis.
- Extrapolating large language models to non-english by aligning languages.
- Geyu Lin (10 papers)
- Bin Wang (750 papers)
- Zhengyuan Liu (41 papers)
- Nancy F. Chen (97 papers)