Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

51 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

55 1

LLaMA Beyond English: An Empirical Study on Language Capability Transfer (2401.01055v2)

Published 2 Jan 2024 in cs.CL and cs.AI

Abstract: In recent times, substantial advancements have been witnessed in LLMs, exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.

PDF HTML Abstract

Introduction

Advances in LLMs have led to breakthroughs in tasks like reasoning, learning from experience, and following instructions. Yet, despite these advances, the overwhelming focus on English corpora has limited LLMs' abilities in other languages.

This paper explores methods for transferring the capabilities of LLMs, specifically LLaMA, to non-English languages with minimal cost. Using extensive GPU resources, the research evaluates vocabulary extension, additional pretraining, and instruction tuning as key factors influencing the transfer process. Testing on both knowledge benchmarks and instruction-following tasks provides a holistic assessment of the model's language capabilities.

Analyzing Transfer Factors

The paper reveals unexpected findings regarding vocabulary extension. Despite theories suggesting its usefulness, extending the vocabulary displays no clear advantage in transferring language capabilities. Surprisingly, vocabulary-extended models pre-trained with 30 billion tokens perform worse than LLaMA models trained on just 0.5 billion tokens.

In terms of training scales, the results indicate that for improving language generation capabilities like fluency and logical coherence, a substantial volume of further pretraining isn't as crucial as a significant amount of instruction tuning. However, in terms of model knowledge like factual accuracy, neither additional pretraining on Chinese nor expanding the vocabulary greatly impacts the LLaMA models' performance.

Maintaining English Proficiency

Another aspect considered is the impact of focused language transfer training on a LLM's original English capabilities. Models exclusively trained with Chinese data demonstrate a reduction in English proficiency, which suggests a trade-off between learning a new language and maintaining existing capabilities. The solution appears to lie in multilingual joint training, which helps preserve English skills while extending to new languages.

Expanding to Multiple Languages

This research also extends its findings beyond Chinese, encompassing 13 low-resource languages to validate the transfer process's effectiveness across diverse linguistic landscapes. The results are consistent, showcasing that the LLaMA model can quickly adapt to new languages with suitable instruction tuning, regardless of the resource level of the target language.

Conclusion and Implications

Overall, the paper concludes that effective language capability transfer to non-English languages can be achieved with significantly less data than previously thought necessary. The research also underlines the internalized cross-lingual alignment in LLMs, observed through code-switching instances in the model's responses, which may play a role in the transferability of language capabilities. These insights have the potential to guide the development of more capable and efficient multilingual LLMs, lowering the barriers for languages with fewer resources.

PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (6)

Jun Zhao (469 papers)
Zhihao Zhang (61 papers)
Qi Zhang (784 papers)
Tao Gui (127 papers)
Xuanjing Huang (287 papers)
Luhui Gao (2 papers)

Citations (53)

View on Semantic Scholar

Tweets

https://twitter.com/abacaj/status/1753819731323224553

https://twitter.com/fly51fly/status/1744003793300423145

https://twitter.com/Jac5Connor/status/1745117645878567237

https://twitter.com/xlr8harder/status/1753340495793991843

https://twitter.com/YebHavinga/status/1770434485235925089

https://twitter.com/Obota_P/status/1754355549204779322

YouTube

Show All Videos