Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca (2304.08177v3)

Published 17 Apr 2023 in cs.CL, cs.HC, and cs.LG
Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

Abstract: LLMs, such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards AGI. Nonetheless, the high costs associated with training and deploying LLMs present substantial obstacles to transparent, accessible academic research. While several LLMs, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA's existing vocabulary with an additional 20,000 Chinese tokens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets, significantly enhancing the model's ability to comprehend and execute instructions. Our experimental results indicate that the newly proposed model markedly enhances the original LLaMA's proficiency in understanding and generating Chinese content. Additionally, the results on the C-Eval dataset yield competitive performance among the models with several times the size of ours. We have made our pre-trained models, training scripts, and other resources available through GitHub, fostering open research for our community. Chinese LLaMA series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca} and Chinese Llama-2 series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca-2}

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

The paper "Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca" presents a significant advancement in adapting LLMs for Chinese language understanding and generation. This work aims to address the limitations of existing LLMs, including LLaMA, which are primarily trained on English-centric corpora, thereby lacking efficient support for Chinese language tasks.

Proposed Method and Model Adaptations

The authors propose a comprehensive approach to enhance LLaMA’s proficiency with Chinese text by undertaking several key steps:

  1. Vocabulary Extension: The original LLaMA model contains less than a thousand Chinese tokens, leading to inefficient encoding of Chinese text. The authors alleviate this by extending LLaMA’s vocabulary with 20,000 additional Chinese tokens. This improves encoding efficiency, reducing the average token length for a given Chinese sequence, and consequently enhancing both the speed and accuracy of text processing.
  2. Low-Rank Adaptation (LoRA): To mitigate the high computational costs associated with training large models, the paper employs the Low-Rank Adaptation (LoRA) method. This approach freezes the original model weights and introduces trainable low-rank matrices, significantly lowering the number of parameters that need updating during training.
  3. Secondary Pre-Training: The model undergoes further pre-training on a 20GB Chinese corpus using the extended vocabulary, referred to as Chinese LLaMA. This step is crucial for adapting the existing LLM to better understand and generate Chinese text.
  4. Fine-Tuning with Instruction Data: The authors extend the capabilities of Chinese LLaMA further by fine-tuning it with a blend of instruction-following datasets, resulting in the development of the Chinese Alpaca models. This fine-tuning improves the model's ability to generate context-aware, instruction-following responses in Chinese.

Experimental Results

The enhanced models, Chinese LLaMA and Chinese Alpaca, were evaluated through various benchmarks:

  1. Instruction Following Tasks:
    • Overall Performance: In instruction-following tasks, Chinese Alpaca models showed substantial improvements over their predecessors in tasks such as question answering, reasoning, dialogue systems, and text generation. Notably, the Alpaca-33B and Alpaca-Plus-13B variants demonstrate competitive performance even against models significantly larger in size.
    • Task-Specific Results: The models excelled particularly in translation tasks and ethical response generation, with Alpaca-33B achieving the highest scores in these areas.
  2. C-Eval Benchmark:
    • Performance Comparison: The Chinese LLaMA and Alpaca models outperformed the original LLaMA on the C-Eval dataset, particularly in zero-shot settings, which underscores the effectiveness of fine-tuning with task-specific data.
    • Instruction Following vs. Pure LLMs: Instruction-following models (Chinese Alpaca) surpassed pure LLMs (Chinese LLaMA), highlighting the advantage of fine-tuning for task-specific adaptability.
  3. Quantization Impact:
    • Inference Efficiency: Different bit-level quantization methods were tested to gauge their impact on model performance. The findings indicate that 6-bit and 8-bit quantization levels maintain comparable perplexity rates to the original FP16 models, making them practical for deployment with reduced computational resources.

Implications and Future Directions

The methodology proposed in this paper offers a blueprint for extending existing LLMs to better serve underrepresented languages by improving vocabulary support and leveraging low-rank adaptations for efficient training. The successful application of these techniques demonstrates the potential for applying similar strategies to other languages, promoting greater inclusivity and utility of LLMs across diverse linguistic contexts.

Theoretical Contributions:

  • The findings suggest that extending the vocabulary and fine-tuning with specific datasets can significantly enhance the model’s capability in understanding and generating text in non-English languages.
  • The demonstrated effectiveness of LoRA provides a scalable approach for deploying LLMs with limited computational resources, an area critical for research institutions and smaller enterprises.

Practical Contributions:

  • The pre-trained models and resources released on GitHub serve as valuable assets for the NLP community, fostering further research and development in multilingual models.
  • The practical implications include improving access to advanced LLMs for non-English speaking regions, which can lead to broader applications in real-world scenarios like education, administrative services, and cross-cultural communication.

Conclusion

This paper delineates a methodical approach to significantly enhance the Chinese language capabilities of LLaMA and Alpaca models. By extending the vocabulary, employing effective adaptation techniques like LoRA, and fine-tuning with comprehensive datasets, the authors have set a precedent for adapting LLMs to non-English languages. The results achieved in various benchmarks underscore the potential of these models and provide a path forward for future advancements in multilingual natural language processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Revisiting pre-trained models for Chinese natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp.  657–668, Online, November 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.findings-emnlp.58.
  2. Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3504–3514, 2021. doi: 10.1109/TASLP.2021.3124365.
  3. Lert: A linguistically-motivated pre-trained language model. arXiv preprint arXiv:2211.05344, 2022.
  4. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/N19-1423.
  6. GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323, 2022.
  7. Georgi Gerganov. llama.cpp. https://github.com/ggerganov/llama.cpp, 2023.
  8. LoRA: Low-Rank Adaptation of Large Language Models. arXiv e-prints, art. arXiv:2106.09685, June 2021. doi: 10.48550/arXiv.2106.09685.
  9. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
  10. OpenAssistant Conversations – Democratizing Large Language Model Alignment. arXiv e-prints, art. arXiv:2304.07327, April 2023. doi: 10.48550/arXiv.2304.07327.
  11. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
  12. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082.
  13. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  14. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  15. OpenAI. GPT-4 Technical Report. arXiv e-prints, art. arXiv:2303.08774, March 2023. doi: 10.48550/arXiv.2303.08774.
  16. Training language models to follow instructions with human feedback. arXiv e-prints, art. arXiv:2203.02155, March 2022. doi: 10.48550/arXiv.2203.02155.
  17. Improving language understanding by generative pre-training. 2018.
  18. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506, 2020.
  19. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  20. Noam Shazeer. Glu variants improve transformer, 2020.
  21. Roformer: Enhanced transformer with rotary position embedding, 2021.
  22. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023a.
  23. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023b.
  24. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  25. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  26. Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv e-prints, art. arXiv:2212.10560, December 2022. doi: 10.48550/arXiv.2212.10560.
  27. Bright Xu. Nlp chinese corpus: Large scale chinese corpus for nlp, September 2019. URL https://doi.org/10.5281/zenodo.3402023.
  28. CINO: A Chinese minority pre-trained language model. In Proceedings of the 29th International Conference on Computational Linguistics, pp.  3937–3949, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.346.
  29. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
  30. Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32, Vancouver, Canada, 2019. URL https://openreview.net/references/pdf?id=S1qBAf6rr.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yiming Cui (80 papers)
  2. Ziqing Yang (29 papers)
  3. Xin Yao (139 papers)
Citations (260)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com