CT-LLM#1: Chinese Tiny LLM Overview

Updated 7 April 2026

CT-LLM#1 is a lightweight Chinese-centric large language model optimized for robust Chinese comprehension and emergent cross-lingual and code capabilities.
The model employs a transformer decoder-only architecture with innovations like universal RoPE, SwiGLU activations, and Soft-MoE to maximize efficiency.
Its training leverages a massive, filtered Chinese pretraining corpus with advanced SFT and DPO fine-tuning, ensuring high-quality, generalizable performance.

Chinese Tiny LLM (CT-LLM#1) denotes a class of lightweight, Chinese-centric LLMs engineered for robust Chinese linguistic and cultural understanding, with emergent cross-lingual and code capabilities. These models, occupying the 1–2 billion parameter regime, are trained from scratch using Chinese-predominant corpora, leveraging modern transformer architectures, sophisticated data curation, and multi-stage alignment. CT-LLM#1, as documented in publicly released resources, exemplifies a shift away from English-centric LLM construction, demonstrating that superior Chinese performance can be achieved without sacrificing multilingual generalization (Du et al., 2024, Gu et al., 10 Feb 2025).

1. Model Architecture and Technical Design

CT-LLM#1 is implemented as a transformer decoder-only architecture, with parameter counts targeting approximately 1B (Steel-LLM adaptation) and 2B (original CT-LLM) to maximize deployment and research accessibility. The principal architecture comprises:

2B-parameter variant: 32 layers, hidden dimension $d_{\text{model}}=2048$ , 16 attention heads (each head size 128), feed-forward width 5,504, with shared input-output embeddings, RoPE positional encoding, RMSNorm normalization, SwiGLU activations, and "multi-write-head" optimizations for attention efficiency (Du et al., 2024).
1B-parameter variant (Steel-LLM adaptation): 18 layers, $d=1792$ , 32 attention heads, Soft-MoE (6 experts, single slot per expert) in the FFN, Qwen1.5 BPE vocabulary (151,936 tokens), and extensive use of hardware-efficient training primitives (FlashAttention, FSDP) (Gu et al., 10 Feb 2025).

Novel architectural elements include universal RoPE (rotary positional encoding) across all layers, SwiGLU instead of ReLU/GELU activations for improved expressivity, pre-layer RMSNorm for stability, and fully-trainable expert diversity through Soft-MoE in small-model regimes. Attention computation follows the canonical scaled-dot-product structure:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right)V$

Layer normalization is performed with RMSNorm:

$\operatorname{RMSNorm}(\mathbf{z}) = \frac{\mathbf{z}}{\sqrt{\frac{1}{d}\sum_{i=1}^d z_i^2 + \epsilon}}\gamma$

Feed-forward blocks utilize SwiGLU: $\operatorname{SwiGLU}(x, y) = x \odot \operatorname{SiLU}(y)$ .

2. Pretraining Corpus Construction and Data Processing

The foundation of CT-LLM#1 is the Massive Appropriate Pretraining Chinese Corpus (MAP-CC#1), assembled for scale and diversity:

Size and Language Distribution: 1,254.7B total tokens (2B variant), split as $\sim$ 67% Chinese (840.5B), $\sim$ 25% English (314.9B), and $\sim$ 8% code (99.3B) (Du et al., 2024). The 1B variant (Steel-LLM) uses $\sim$ 85% Chinese, 10% English, 5% code (Gu et al., 10 Feb 2025).
Sources: CommonCrawl, encyclopedias, academic articles, books, open-source code repositories, as well as curated Chinese corpora such as SkyPile, Wanjuan1.0, BELLE, and MOSS.
Quality Control Pipeline: Text is filtered with multi-stage heuristics—format standardization, terminal punctuation checks, profanity detection, entropy and fastText scoring, length and repetition rules. Deduplication employs Bloom filters, MinHash LSH for high Jaccard similarity, and SimHash for Hamming distance in the 1B variant. Perplexity-based filtration and digit-ratio constraints ensure both linguistic authenticity and representational robustness.

Tokenization is performed using SentencePiece BPE: baichuan2 with a 125,696-vocabulary (2B) or Qwen1.5 BPE (1B; 151,936 tokens), supporting a context length of 4,096 (2B) and 2,048 (1B) (Du et al., 2024, Gu et al., 10 Feb 2025).

3. Training Objectives and Optimization Protocols

Pretraining is conducted exclusively via standard autoregressive causal language modeling:

$\mathcal{L}_{\text{CE}} = -\sum_{t} \log p(x_t \mid x_{<t})$

No auxiliary losses (e.g., span corruption, contrastive) are applied. The optimizer is AdamW with weight decay, using a cosine annealing learning rate schedule after a fixed warmup, and standard β parameters (Du et al., 2024, Gu et al., 10 Feb 2025). Mixed-precision arithmetic (BF16/fp16) and FlashAttention are employed for memory and computational efficiency.

Batch size and global optimization hyperparameters are tuned in view of resource constraints, with the 2B CT-LLM reporting total compute of several hundred-petaflop days over $d=1792$ 0 tokens, and the 1B variant executed on 8 $d=1792$ 1A100/H800 80GB GPUs for 1.07M steps over 30 days (1T tokens) (Du et al., 2024, Gu et al., 10 Feb 2025).

Resource-constrained variants adopt PyTorch FSDP for sharding, operator fusion, and effective checkpoint serialization.

4. Alignment, Fine-Tuning, and Human Preference Learning

Alignment proceeds via sequential Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO):

SFT: Trained on instruction–response pairs, with datasets such as COIG-CQIA (Chinese), OL-CC, COIG-PC, and OpenHermesPreferences (English). Key SFT experiments tune Chinese:English mix ratios (1:1, 2:1, 4:1, 8:1, and monolingual) with the best aggregate performance at 2:1 (Du et al., 2024).
Data filtering: SFT pairs passing Qwen-7B perplexity < 3,000 are retained, enhancing sample quality.
DPO (Direct Preference Optimization): Human and synthetic ranked response pairs ( $d=1792$ 2183K Chinese, 46K English) are used to optimize for preference consistency. DPO hyperparameters: batch=4, learning rate $d=1792$ 3, β=0.5, trained 2 epochs using H800 GPUs.

The DPO objective follows the formulation:

$d=1792$ 4

where $d=1792$ 5 is the SFT-aligned policy.

5. Evaluation Benchmarks and Comparative Metrics

CT-LLM#1 performance is rigorously assessed using both standardized and Chinese-centric benchmarks:

Standard: BoolQ, COPA, HellaSwag, RTE, WiC, MMLU, PIQA, ARC, GSM8K, HumanEval, MBPP (multilingual/English).
Chinese-centric: C-Eval, CMMLU, and the Chinese Hard Case Benchmark (CHC-Bench#1) covering 214 tasks in writing, humanity/history, science, math, reading comprehension, role-play, hard-case Chinese, and coding. Scoring is performed by GPT-4 using compositional criteria (helpfulness, relevance, accuracy, depth, creativity, detail); MC tasks utilize perplexity.

Safety is evaluated with Cvalues responsibility benchmarks (MC and QA, GPT-4 scores).

Benchmark summary:

Model	C-Eval (%)	CMMLU (%)	Safety (Cvalues)	CHC-Bench Subtasks
CT-LLM#1 (2B, SFT-DPO)	Top-tier	Top-tier	2nd on Cvalues	1st: Writing, Math
Steel-LLM-Chat (1B)	41.9	36.1	—	—
CT-LLM-SFT-2B	41.5	41.5	—	—

On Chinese benchmarks, CT-LLM#1 leads 2B-class models for writing, role-play, math, and hard-case Chinese, while maintaining competitive English and code performance post-SFT and DPO alignment (Du et al., 2024, Gu et al., 10 Feb 2025).

6. Engineering Insights and Practical Lessons

Key findings reveal that:

Chinese-centric pretraining (67–85% Chinese) directly maximizes Chinese proficiency; contrast with traditional English-centric pretraining and subsequent cross-lingual adaptation (Du et al., 2024, Gu et al., 10 Feb 2025).
Small English fraction with post-SFT is sufficient to trigger strong multilingual generalization.
Data quality via aggressive filtering/deduplication notably augments robustness and reliability, superseding gains from sheer data volume alone.
Optimal SFT data proportion is found at 2:1 Chinese:English, balancing bilingual generalization and avoiding overfitting to a single language.
Resource-aware architectural choices (Soft-MoE, mixed-precision, operator fusion) enable feasible small-model training without quality loss.

Emergent scaling effects yield smooth, predictable improvements in reasoning, coding, and Chinese comprehension up to $d=1792$ 6800B tokens, with diminishing returns thereafter.

Recommended best practices for tiny Chinese LLM construction include upweighting high-quality Chinese data, using Soft-MoE over sparse-MoE in small models, stability pre-norm (RMSNorm), and mirroring pretraining distribution in fine-tuning data (Gu et al., 10 Feb 2025).

7. Positioning and Future Prospects

CT-LLM#1 redefines the standard for lightweight, open-source Chinese LLMs by systematically prioritizing primary-language pretraining and releasing a complete pipeline: open data (MAP-CC#1), evaluation benchmarks (CHC-Bench#1), and reproducible recipes. This approach catalyzes a paradigm shift toward non-English-centric, resource-efficient multilingual foundation models and provides a practical blueprint for academics and practitioners facing computational or data bottlenecks (Du et al., 2024, Gu et al., 10 Feb 2025).

Availability of source code and checkpoints (e.g., Steel-LLM: https://github.com/zhanshijinwat/Steel-LLM) lowers the barrier to further research and adoption. A plausible implication is that continued refinement of tiny Chinese LLMs, leveraging increasingly sophisticated filtering, fine-tuning heuristics, and model engineering, will further narrow the performance gap with larger-scale, English-centered models while retaining accessibility and deployment advantages.

Markdown Report Issue Upgrade to Chat

References (2)

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model (2024)

Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chinese Tiny LLM (CT-LLM#1).