Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca (2304.08177v3)

Published 17 Apr 2023 in cs.CL, cs.HC, and cs.LG

Abstract: LLMs, such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards AGI. Nonetheless, the high costs associated with training and deploying LLMs present substantial obstacles to transparent, accessible academic research. While several LLMs, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA's existing vocabulary with an additional 20,000 Chinese tokens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets, significantly enhancing the model's ability to comprehend and execute instructions. Our experimental results indicate that the newly proposed model markedly enhances the original LLaMA's proficiency in understanding and generating Chinese content. Additionally, the results on the C-Eval dataset yield competitive performance among the models with several times the size of ours. We have made our pre-trained models, training scripts, and other resources available through GitHub, fostering open research for our community. Chinese LLaMA series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca} and Chinese Llama-2 series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca-2}

PDF HTML Abstract

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

The paper "Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca" presents a significant advancement in adapting LLMs for Chinese language understanding and generation. This work aims to address the limitations of existing LLMs, including LLaMA, which are primarily trained on English-centric corpora, thereby lacking efficient support for Chinese language tasks.

Proposed Method and Model Adaptations

The authors propose a comprehensive approach to enhance LLaMA’s proficiency with Chinese text by undertaking several key steps:

Vocabulary Extension: The original LLaMA model contains less than a thousand Chinese tokens, leading to inefficient encoding of Chinese text. The authors alleviate this by extending LLaMA’s vocabulary with 20,000 additional Chinese tokens. This improves encoding efficiency, reducing the average token length for a given Chinese sequence, and consequently enhancing both the speed and accuracy of text processing.
Low-Rank Adaptation (LoRA): To mitigate the high computational costs associated with training large models, the paper employs the Low-Rank Adaptation (LoRA) method. This approach freezes the original model weights and introduces trainable low-rank matrices, significantly lowering the number of parameters that need updating during training.
Secondary Pre-Training: The model undergoes further pre-training on a 20GB Chinese corpus using the extended vocabulary, referred to as Chinese LLaMA. This step is crucial for adapting the existing LLM to better understand and generate Chinese text.
Fine-Tuning with Instruction Data: The authors extend the capabilities of Chinese LLaMA further by fine-tuning it with a blend of instruction-following datasets, resulting in the development of the Chinese Alpaca models. This fine-tuning improves the model's ability to generate context-aware, instruction-following responses in Chinese.

Experimental Results

The enhanced models, Chinese LLaMA and Chinese Alpaca, were evaluated through various benchmarks:

Instruction Following Tasks:
- Overall Performance: In instruction-following tasks, Chinese Alpaca models showed substantial improvements over their predecessors in tasks such as question answering, reasoning, dialogue systems, and text generation. Notably, the Alpaca-33B and Alpaca-Plus-13B variants demonstrate competitive performance even against models significantly larger in size.
- Task-Specific Results: The models excelled particularly in translation tasks and ethical response generation, with Alpaca-33B achieving the highest scores in these areas.
C-Eval Benchmark:
- Performance Comparison: The Chinese LLaMA and Alpaca models outperformed the original LLaMA on the C-Eval dataset, particularly in zero-shot settings, which underscores the effectiveness of fine-tuning with task-specific data.
- Instruction Following vs. Pure LLMs: Instruction-following models (Chinese Alpaca) surpassed pure LLMs (Chinese LLaMA), highlighting the advantage of fine-tuning for task-specific adaptability.
Quantization Impact:
- Inference Efficiency: Different bit-level quantization methods were tested to gauge their impact on model performance. The findings indicate that 6-bit and 8-bit quantization levels maintain comparable perplexity rates to the original FP16 models, making them practical for deployment with reduced computational resources.

Implications and Future Directions

The methodology proposed in this paper offers a blueprint for extending existing LLMs to better serve underrepresented languages by improving vocabulary support and leveraging low-rank adaptations for efficient training. The successful application of these techniques demonstrates the potential for applying similar strategies to other languages, promoting greater inclusivity and utility of LLMs across diverse linguistic contexts.

Theoretical Contributions:

The findings suggest that extending the vocabulary and fine-tuning with specific datasets can significantly enhance the model’s capability in understanding and generating text in non-English languages.
The demonstrated effectiveness of LoRA provides a scalable approach for deploying LLMs with limited computational resources, an area critical for research institutions and smaller enterprises.

Practical Contributions:

The pre-trained models and resources released on GitHub serve as valuable assets for the NLP community, fostering further research and development in multilingual models.
The practical implications include improving access to advanced LLMs for non-English speaking regions, which can lead to broader applications in real-world scenarios like education, administrative services, and cross-cultural communication.

Conclusion

This paper delineates a methodical approach to significantly enhance the Chinese language capabilities of LLaMA and Alpaca models. By extending the vocabulary, employing effective adaptation techniques like LoRA, and fine-tuning with comprehensive datasets, the authors have set a precedent for adapting LLMs to non-English languages. The results achieved in various benchmarks underscore the potential of these models and provide a path forward for future advancements in multilingual natural language processing.

PDF Markdown Bookmark Chat (Pro)

References (30)

Authors (3)

Yiming Cui (80 papers)
Ziqing Yang (29 papers)
Xin Yao (139 papers)

Citations (260)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/societyoftrees/status/1753858824383959056

YouTube

Show All Videos