Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model (2404.04167v5)

Published 5 Apr 2024 in cs.CL and cs.AI

Abstract: In this study, we introduce CT-LLM, a 2B LLM that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile LLMs.

Pretraining a Chinese-Centric LLM (CT-LLM#1)

Introduction to CT-LLM#1

The development of LLMs traditionally leverages extensive English datasets, leading to advancements in understanding and generating natural language. However, this practice tends to overshadow the linguistic diversity inherent in human languages. Addressing this gap, the recently introduced Chinese Tiny LLM (CT-LLM#1), a 2 billion parameter model, signifies a shift in focus toward prioritizing the Chinese language from the get-go. Unlike conventional models, CT-LLM#1 was meticulously pretrained on a comprehensive corpus comprising 1,200 billion tokens, with a significant portion being Chinese tokens. This model challenges the prevailing norms in LLM training, showcasing remarkable capabilities in handling Chinese language tasks and suggesting a broader scope for training methodologies that embrace linguistic diversity.

Methodology Behind CT-LLM#1

Dataset Composition

The training dataset for CT-LLM#1 was meticulously assembled to ensure a vast and diverse coverage of Chinese text, encompassing 840.48 billion Chinese tokens, 314.88 billion English tokens, and 99.3 billion code tokens. To refine the dataset quality, data filtering employed heuristic rules tailored specifically for Chinese texts, addressing the challenge of data diversity and quality variance noted in previous models.

Model Architecture and Training

CT-LLM#1 utilizes a transformer-based architecture, with modifications including multi-head attention mechanisms, SwiGLU activations, and RoPE embeddings, to optimize performance for the Chinese language. The tokenizer design and vocabulary size were carefully chosen to better encode numerical data and accommodate the Chinese language's nuances.

Supervised Fine-Tuning (SFT) and Human Preferences Learning

SFT was employed using both Chinese and English data to enhance the model's multilingual capacities. The model underwent SFT with various ratios of Chinese to English data, where results indicated remarkable proficiency in Chinese language tasks. Additionally, Direct Preference Optimization (DPO) was utilized to align the model more closely with human preferences, focusing on generating harmless and helpful responses.

Evaluation and Benchmarks

CT-LLM#1 underwent rigorous evaluations across multiple benchmarks, demonstrating its exceptional ability in Chinese language processing and multilingual tasks. The introduction of a new benchmark, the Chinese Hard Case Benchmark (CHC-Bench#1), specifically aimed to measure instruction understanding in Chinese, further confirmed the model's adeptness. The successful alignment with human preferences also marked significant progress in developing safer and more user-friendly LLMs.

Implications and Future Directions

By diverging from the predominantly English-focused training methodologies, CT-LLM#1 paves the way for more inclusive and versatile LLMs. Its remarkable performance in understanding and generating Chinese text underscores the potential for LLMs dedicated to other languages. Moreover, the open-sourcing of CT-LLM#1’s training process, including the comprehensive dataset and benchmarks, invites further exploration and innovation in the field, potentially leading to advancements in multilingual LLMs and their applications across diverse linguistic landscapes. Future research efforts might explore the scalability of such models, the integration of even more linguistic diversity, and the refinement of methodology for aligning LLMs with human preferences across various cultural contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Minicpm: Unveiling the potential of end-side large language models, 2024.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. BAAI. BAAI-CCI: Chinese internet corpus. https://data.baai.ac.cn/details/BAAI-CCI, 2023. Accessed: 2024-03-27.
  4. Qwen technical report. ArXiv preprint, abs/2309.16609, 2023. URL https://arxiv.org/abs/2309.16609.
  5. Coig-cqia: Quality is all you need for chinese instruction fine-tuning, 2024.
  6. Deepseek llm: Scaling open-source language models with longtermism. ArXiv preprint, abs/2401.02954, 2024. URL https://arxiv.org/abs/2401.02954.
  7. Rethinking llm language adaptation: A case study on chinese mixtral. ArXiv preprint, abs/2403.01851, 2024. URL https://arxiv.org/abs/2403.01851.
  8. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models. ArXiv preprint, abs/2305.08283, 2023. URL https://arxiv.org/abs/2305.08283.
  9. Textbooks are all you need. ArXiv preprint, abs/2306.11644, 2023. URL https://arxiv.org/abs/2306.11644.
  10. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  11. Training compute-optimal large language models, 2022.
  12. Open hermes preferences. https://huggingface.co/datasets/argilla/OpenHermesPreferences, 2024a.
  13. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024b.
  14. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
  15. Mistral 7b. ArXiv preprint, abs/2310.06825, 2023. URL https://arxiv.org/abs/2310.06825.
  16. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  66–71, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
  17. Bloom: A 176b-parameter open-access multilingual language model. 2022.
  18. Cmmlu: Measuring massive multitask language understanding in chinese. ArXiv preprint, abs/2306.09212, 2023a. URL https://arxiv.org/abs/2306.09212.
  19. Cif-bench: A chinese instruction-following benchmark for evaluating the generalizability of large language models. arXiv preprint arXiv:2402.13109, 2024.
  20. Textbooks are all you need ii: phi-1.5 technical report. ArXiv preprint, abs/2309.05463, 2023b. URL https://arxiv.org/abs/2309.05463.
  21. Yayi 2: Multilingual open-source large language models. ArXiv preprint, abs/2312.14862, 2023. URL https://arxiv.org/abs/2312.14862.
  22. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
  23. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  24. Scaling language models: Methods, analysis & insights from training gopher, 2022.
  25. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  27. Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019.
  28. Noam Shazeer. GLU variants improve transformer. ArXiv preprint, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
  29. Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999.
  30. Roformer: Enhanced transformer with rotary position embedding. ArXiv preprint, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  31. Gemma: Open models based on gemini research and technology. ArXiv preprint, abs/2403.08295, 2024. URL https://arxiv.org/abs/2403.08295.
  32. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport, 2023.
  33. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
  34. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023b. URL https://arxiv.org/abs/2307.09288.
  35. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  36. Aya model: An instruction finetuned open-access multilingual language model. ArXiv preprint, abs/2402.07827, 2024. URL https://arxiv.org/abs/2402.07827.
  37. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017a. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  38. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017b. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  39. Telechat technical report. ArXiv preprint, abs/2401.03804, 2024. URL https://arxiv.org/abs/2401.03804.
  40. Skywork: A more open bilingual foundation model. ArXiv preprint, abs/2310.19341, 2023. URL https://arxiv.org/abs/2310.19341.
  41. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  4003–4012, Marseille, France, 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.494.
  42. Cvalues: Measuring the values of chinese large language models from safety to responsibility, 2023.
  43. Baichuan 2: Open large-scale language models. ArXiv preprint, abs/2309.10305, 2023. URL https://arxiv.org/abs/2309.10305.
  44. Yi: Open foundation models by 01. ai. ArXiv preprint, abs/2403.04652, 2024. URL https://arxiv.org/abs/2403.04652.
  45. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68, 2021.
  46. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
  47. Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  12360–12371, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/1e8a19426224ca89e83cef47f1e7f53b-Abstract.html.
  48. Chinese open instruction generalist: A preliminary release, 2023.
  49. Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. CoRR, abs/2209.02970, 2022.
  50. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
  51. Llama beyond english: An empirical study on language capability transfer. ArXiv preprint, abs/2401.01055, 2024. URL https://arxiv.org/abs/2401.01055.
  52. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024a.
  53. Kun: Answer polishment for chinese self-alignment with instruction back-translation. ArXiv preprint, abs/2401.06477, 2024b. URL https://arxiv.org/abs/2401.06477.
  54. Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372, 2024c. URL http://arxiv.org/abs/2403.13372.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Xinrun Du (23 papers)
  2. Zhouliang Yu (8 papers)
  3. Songyang Gao (28 papers)
  4. Ding Pan (30 papers)
  5. Yuyang Cheng (3 papers)
  6. Ziyang Ma (73 papers)
  7. Ruibin Yuan (43 papers)
  8. Xingwei Qu (30 papers)
  9. Jiaheng Liu (100 papers)
  10. Tianyu Zheng (28 papers)
  11. Xinchen Luo (7 papers)
  12. Guorui Zhou (48 papers)
  13. Wenhu Chen (134 papers)
  14. Ge Zhang (170 papers)
Citations (10)
Youtube Logo Streamline Icon: https://streamlinehq.com