Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

289 1

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model (2404.04167v5)

Published 5 Apr 2024 in cs.CL and cs.AI

Abstract: In this study, we introduce CT-LLM, a 2B LLM that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile LLMs.

PDF HTML Abstract

Pretraining a Chinese-Centric LLM (CT-LLM#1)

Introduction to CT-LLM#1

The development of LLMs traditionally leverages extensive English datasets, leading to advancements in understanding and generating natural language. However, this practice tends to overshadow the linguistic diversity inherent in human languages. Addressing this gap, the recently introduced Chinese Tiny LLM (CT-LLM#1), a 2 billion parameter model, signifies a shift in focus toward prioritizing the Chinese language from the get-go. Unlike conventional models, CT-LLM#1 was meticulously pretrained on a comprehensive corpus comprising 1,200 billion tokens, with a significant portion being Chinese tokens. This model challenges the prevailing norms in LLM training, showcasing remarkable capabilities in handling Chinese language tasks and suggesting a broader scope for training methodologies that embrace linguistic diversity.

Methodology Behind CT-LLM#1

Dataset Composition

The training dataset for CT-LLM#1 was meticulously assembled to ensure a vast and diverse coverage of Chinese text, encompassing 840.48 billion Chinese tokens, 314.88 billion English tokens, and 99.3 billion code tokens. To refine the dataset quality, data filtering employed heuristic rules tailored specifically for Chinese texts, addressing the challenge of data diversity and quality variance noted in previous models.

Model Architecture and Training

CT-LLM#1 utilizes a transformer-based architecture, with modifications including multi-head attention mechanisms, SwiGLU activations, and RoPE embeddings, to optimize performance for the Chinese language. The tokenizer design and vocabulary size were carefully chosen to better encode numerical data and accommodate the Chinese language's nuances.

Supervised Fine-Tuning (SFT) and Human Preferences Learning

SFT was employed using both Chinese and English data to enhance the model's multilingual capacities. The model underwent SFT with various ratios of Chinese to English data, where results indicated remarkable proficiency in Chinese language tasks. Additionally, Direct Preference Optimization (DPO) was utilized to align the model more closely with human preferences, focusing on generating harmless and helpful responses.

Evaluation and Benchmarks

CT-LLM#1 underwent rigorous evaluations across multiple benchmarks, demonstrating its exceptional ability in Chinese language processing and multilingual tasks. The introduction of a new benchmark, the Chinese Hard Case Benchmark (CHC-Bench#1), specifically aimed to measure instruction understanding in Chinese, further confirmed the model's adeptness. The successful alignment with human preferences also marked significant progress in developing safer and more user-friendly LLMs.

Implications and Future Directions

By diverging from the predominantly English-focused training methodologies, CT-LLM#1 paves the way for more inclusive and versatile LLMs. Its remarkable performance in understanding and generating Chinese text underscores the potential for LLMs dedicated to other languages. Moreover, the open-sourcing of CT-LLM#1’s training process, including the comprehensive dataset and benchmarks, invites further exploration and innovation in the field, potentially leading to advancements in multilingual LLMs and their applications across diverse linguistic landscapes. Future research efforts might explore the scalability of such models, the integration of even more linguistic diversity, and the refinement of methodology for aligning LLMs with human preferences across various cultural contexts.

PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (14)

Xinrun Du (23 papers)
Zhouliang Yu (8 papers)
Songyang Gao (28 papers)
Ding Pan (30 papers)
Yuyang Cheng (3 papers)
Ziyang Ma (73 papers)
Ruibin Yuan (43 papers)
Xingwei Qu (30 papers)
Jiaheng Liu (100 papers)
Tianyu Zheng (28 papers)
Xinchen Luo (7 papers)
Guorui Zhou (48 papers)
Wenhu Chen (134 papers)
Ge Zhang (170 papers)

Citations (10)

View on Semantic Scholar

Tweets

https://twitter.com/arankomatsuzaki/status/1777144027814477961

https://twitter.com/_akhaliq/status/1777187978672005222

https://twitter.com/GeZhang86038849/status/1811279001417052548

https://twitter.com/fly51fly/status/1777453464114864599

https://twitter.com/GeZhang86038849/status/1811323501262844204

https://twitter.com/sawubonagmbh/status/1883203564182655017

YouTube

Show All Videos