Pretraining a Chinese-Centric LLM (CT-LLM#1)
Introduction to CT-LLM#1
The development of LLMs traditionally leverages extensive English datasets, leading to advancements in understanding and generating natural language. However, this practice tends to overshadow the linguistic diversity inherent in human languages. Addressing this gap, the recently introduced Chinese Tiny LLM (CT-LLM#1), a 2 billion parameter model, signifies a shift in focus toward prioritizing the Chinese language from the get-go. Unlike conventional models, CT-LLM#1 was meticulously pretrained on a comprehensive corpus comprising 1,200 billion tokens, with a significant portion being Chinese tokens. This model challenges the prevailing norms in LLM training, showcasing remarkable capabilities in handling Chinese language tasks and suggesting a broader scope for training methodologies that embrace linguistic diversity.
Methodology Behind CT-LLM#1
Dataset Composition
The training dataset for CT-LLM#1 was meticulously assembled to ensure a vast and diverse coverage of Chinese text, encompassing 840.48 billion Chinese tokens, 314.88 billion English tokens, and 99.3 billion code tokens. To refine the dataset quality, data filtering employed heuristic rules tailored specifically for Chinese texts, addressing the challenge of data diversity and quality variance noted in previous models.
Model Architecture and Training
CT-LLM#1 utilizes a transformer-based architecture, with modifications including multi-head attention mechanisms, SwiGLU activations, and RoPE embeddings, to optimize performance for the Chinese language. The tokenizer design and vocabulary size were carefully chosen to better encode numerical data and accommodate the Chinese language's nuances.
Supervised Fine-Tuning (SFT) and Human Preferences Learning
SFT was employed using both Chinese and English data to enhance the model's multilingual capacities. The model underwent SFT with various ratios of Chinese to English data, where results indicated remarkable proficiency in Chinese language tasks. Additionally, Direct Preference Optimization (DPO) was utilized to align the model more closely with human preferences, focusing on generating harmless and helpful responses.
Evaluation and Benchmarks
CT-LLM#1 underwent rigorous evaluations across multiple benchmarks, demonstrating its exceptional ability in Chinese language processing and multilingual tasks. The introduction of a new benchmark, the Chinese Hard Case Benchmark (CHC-Bench#1), specifically aimed to measure instruction understanding in Chinese, further confirmed the model's adeptness. The successful alignment with human preferences also marked significant progress in developing safer and more user-friendly LLMs.
Implications and Future Directions
By diverging from the predominantly English-focused training methodologies, CT-LLM#1 paves the way for more inclusive and versatile LLMs. Its remarkable performance in understanding and generating Chinese text underscores the potential for LLMs dedicated to other languages. Moreover, the open-sourcing of CT-LLM#1’s training process, including the comprehensive dataset and benchmarks, invites further exploration and innovation in the field, potentially leading to advancements in multilingual LLMs and their applications across diverse linguistic landscapes. Future research efforts might explore the scalability of such models, the integration of even more linguistic diversity, and the refinement of methodology for aligning LLMs with human preferences across various cultural contexts.