Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLaMA Beyond English: An Empirical Study on Language Capability Transfer (2401.01055v2)

Published 2 Jan 2024 in cs.CL and cs.AI
LLaMA Beyond English: An Empirical Study on Language Capability Transfer

Abstract: In recent times, substantial advancements have been witnessed in LLMs, exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.

Introduction

Advances in LLMs have led to breakthroughs in tasks like reasoning, learning from experience, and following instructions. Yet, despite these advances, the overwhelming focus on English corpora has limited LLMs' abilities in other languages.

This paper explores methods for transferring the capabilities of LLMs, specifically LLaMA, to non-English languages with minimal cost. Using extensive GPU resources, the research evaluates vocabulary extension, additional pretraining, and instruction tuning as key factors influencing the transfer process. Testing on both knowledge benchmarks and instruction-following tasks provides a holistic assessment of the model's language capabilities.

Analyzing Transfer Factors

The paper reveals unexpected findings regarding vocabulary extension. Despite theories suggesting its usefulness, extending the vocabulary displays no clear advantage in transferring language capabilities. Surprisingly, vocabulary-extended models pre-trained with 30 billion tokens perform worse than LLaMA models trained on just 0.5 billion tokens.

In terms of training scales, the results indicate that for improving language generation capabilities like fluency and logical coherence, a substantial volume of further pretraining isn't as crucial as a significant amount of instruction tuning. However, in terms of model knowledge like factual accuracy, neither additional pretraining on Chinese nor expanding the vocabulary greatly impacts the LLaMA models' performance.

Maintaining English Proficiency

Another aspect considered is the impact of focused language transfer training on a LLM's original English capabilities. Models exclusively trained with Chinese data demonstrate a reduction in English proficiency, which suggests a trade-off between learning a new language and maintaining existing capabilities. The solution appears to lie in multilingual joint training, which helps preserve English skills while extending to new languages.

Expanding to Multiple Languages

This research also extends its findings beyond Chinese, encompassing 13 low-resource languages to validate the transfer process's effectiveness across diverse linguistic landscapes. The results are consistent, showcasing that the LLaMA model can quickly adapt to new languages with suitable instruction tuning, regardless of the resource level of the target language.

Conclusion and Implications

Overall, the paper concludes that effective language capability transfer to non-English languages can be achieved with significantly less data than previously thought necessary. The research also underlines the internalized cross-lingual alignment in LLMs, observed through code-switching instances in the model's responses, which may play a role in the transferability of language capabilities. These insights have the potential to guide the development of more capable and efficient multilingual LLMs, lowering the barriers for languages with fewer resources.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. PaLM 2 Technical Report. arXiv:2305.10403.
  2. On the Cross-lingual Transferability of Monolingual Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4623–4637. Online: Association for Computational Linguistics.
  3. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712.
  4. NusaCrowd: Open Source Initiative for Indonesian NLP Resources. arXiv:2212.09648.
  5. Multilingual Alignment of Contextual Word Representations. arXiv:2002.03518.
  6. Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural Machine Translation. arXiv:2110.08547.
  7. Finding Universal Grammatical Relations in Multilingual BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5564–5577. Online: Association for Computational Linguistics.
  8. Training Verifiers to Solve Math Word Problems. CoRR, abs/2110.14168.
  9. Unsupervised Cross-lingual Representation Learning at Scale. arXiv:1911.02116.
  10. Emerging Cross-lingual Structure in Pretrained Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6022–6034. Online: Association for Computational Linguistics.
  11. Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM.
  12. Chinese LLaMA and Alpaca Large Language Models.
  13. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca. arXiv:2304.08177.
  14. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
  15. A Survey on In-context Learning. arXiv:2301.00234.
  16. Identifying Elements Essential for BERT’s Multilinguality. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4423–4437. Online: Association for Computational Linguistics.
  17. Zero-shot cross-lingual transfer language selection using linguistic similarity. Information Processing & Management, 60(3): 103250.
  18. Measuring Massive Multitask Language Understanding. CoRR, abs/2009.03300.
  19. LoRA: Low-Rank Adaptation of Large Language Models. CoRR, abs/2106.09685.
  20. Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting. arXiv:2305.07004.
  21. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; and Sabato, S., eds., Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, 9118–9147. PMLR.
  22. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv:2305.08322.
  23. BELLE: Be Everyone’s Large Language model Engine. https://github.com/LianjiaTech/BELLE.
  24. X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 5943–5959. Online: Association for Computational Linguistics.
  25. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6282–6293. Online: Association for Computational Linguistics.
  26. Gpt-4 passes the bar exam. Available at SSRN 4389233.
  27. GLUECoS: An Evaluation Benchmark for Code-Switched NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3575–3585. Online: Association for Computational Linguistics.
  28. Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent Prediction and Slot Filling. arXiv:2103.07792.
  29. Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation. arXiv:2305.15011.
  30. Few-shot Learning with Multilingual Language Models. arXiv:2112.10668.
  31. Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts. arXiv:2306.11372.
  32. OpenAI. 2022. Introducing ChatGPT.
  33. OpenLMLab. 2023. Open-Chinese-LLaMA.
  34. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. arXiv:2306.01116.
  35. How Multilingual is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4996–5001. Florence, Italy: Association for Computational Linguistics.
  36. Linguistic Diversity in Natural Language Processing. Traitement Automatique des Langues, 62(3): 7–11.
  37. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv:2211.05100.
  38. StabilityAI. 2023. Announcing StableCode.
  39. Code-Mixing on Sesame Street: Dawn of the Adversarial Polyglots. arXiv:2103.09593.
  40. Alpaca: A Strong, Replicable Instruction-Following Model.
  41. Team, I. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities.
  42. Team, I. 2023b. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. https://github.com/InternLM/InternLM-techreport.
  43. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
  44. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
  45. Are Multilingual Models Effective in Code-Switching? arXiv:2103.13309.
  46. Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), 181–186. Florence, Italy: Association for Computational Linguistics.
  47. Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3541–3547. Hong Kong, China: Association for Computational Linguistics.
  48. Language Models are Few-shot Multilingual Learners. In Proceedings of the 1st Workshop on Multilingual Representation Learning, 1–15. Punta Cana, Dominican Republic: Association for Computational Linguistics.
  49. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 833–844. Hong Kong, China: Association for Computational Linguistics.
  50. Oolong: Investigating What Makes Crosslingual Transfer Hard with Controlled Studies. arXiv:2202.12312.
  51. LLMEVAL-1 Chinese Large Language Model Evaluation Phase 1.
  52. Evaluating the Performance of Large Language Models on GAOKAO Benchmark. arXiv:2305.12474.
  53. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364.
  54. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis. arXiv:2304.04675.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jun Zhao (469 papers)
  2. Zhihao Zhang (61 papers)
  3. Qi Zhang (784 papers)
  4. Tao Gui (127 papers)
  5. Xuanjing Huang (287 papers)
  6. Luhui Gao (2 papers)
Citations (53)
Youtube Logo Streamline Icon: https://streamlinehq.com