Chinese CharacterDial Corpus
- Chinese CharacterDial Corpus is a family of curated datasets designed for training and evaluating Chinese language models and character-based dialogue systems.
- It employs advanced sinographic structure analysis, statistical language modeling, and rigorous LLM-based scoring to ensure high data quality and educational value.
- Empirical studies validate that models trained on the corpus outperform baselines in coherence, engagement, and performance on benchmarks such as C-Eval and Alignbench.
The Chinese CharacterDial Corpus refers to contemporary, high-quality Chinese language corpora explicitly engineered for training, fine-tuning, and evaluating LLMs and character-based dialogue systems. This corpus conceptually encompasses a family of datasets—such as Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese—each tailored for distinct NLP objectives, including representation learning, educational content acquisition, and conversational alignment in both statistical and neural architectures. The underlying research draws upon advanced methodologies in data curation, sinographic structure exploitation, statistical LLMing, and large-scale neural network evaluation, thereby establishing the corpus as a central resource for empirical and theoretical inquiry into Chinese language technologies.
1. Sinographic Composition and Statistical Rationale
The construction and analysis of the CharacterDial Corpus are grounded in recent advances in sinographic language processing, particularly the systematic dismantling of Chinese script into functional linguistic units. Chinese writing comprises three principal granularities: full characters, individual strokes, and so-called constructive parts—sub-character components more complex than strokes but less than whole characters. Statistical analyses across historical and contemporary corpora demonstrate that constructive parts closely mirror the “letter” distributions observed in alphabetic languages when measuring frequency rank statistics via metrics such as the Kolmogorov–Smirnov statistic (), rescaled rank-frequency plots, and functional fits (e.g., Cocho/Beta equations, quadratic logarithms) (Chen et al., 2020). This consistency underscores the appropriateness of constructive parts as foundational analytic units and informs the corpus design paradigm, enabling structurally principled feature engineering for computational models.
2. Corpus Components and Dataset Characteristics
CharacterDial Corpus aggregates four distinct datasets curated for Chinese LLM development (Yu et al., 14 Jan 2025):
Component | Primary Function | Curation Approach |
---|---|---|
Fineweb-edu-chinese | Pretraining, education-focused | Filtered, scored selection from open web corpora |
Fineweb-edu-chinese-v2 | Enhanced pretraining, clarity | Expanded sources, stricter scoring/refinement |
Cosmopedia-chinese | Synthetic textbook content | LLM-generated from curated seed lessons |
Smoltalk-chinese | Multi-turn dialogue alignment | Automatic multi-turn chat generation & filtering |
- Fineweb-edu-chinese pools open-source Chinese datasets (e.g., Wudao, Map-CC), selecting samples scoring ≥3 for “educational value” via LLM-based ratings. Content is deduplicated using MinHash (threshold 0.7) for high instructional density.
- Fineweb-edu-chinese-v2 introduces new sources (e.g., Michao, CCI3), employs Qwen2.5-14b-instruct for improved scoring, and further increases sample quality and diversity.
- Cosmopedia-chinese is a synthetic corpus of textbook-style discourse generated by long-context LLMs (glm4-9b-longwriter) using seed data from expert domains. Controlled generation settings (e.g., temperature 0.8) and deduplication yield a corpus (~15M samples) emphasizing factual density and coherence.
- Smoltalk-chinese produces multi-turn, stylistically diverse dialogue by prompting advanced LLMs (Deepseek-V2.5, Qwen2.5-72B-Instruct) with scoring-based filtering and embedding deduplication (thresholds 0.8 multi-turn, 0.7 single-turn). Resulting in ~70,000 interaction-rich samples, it targets instruction tuning and conversational alignment.
3. Methodologies for Data Curation and Quality Control
Data quality and diversity are maintained through a layered curation pipeline:
- LLM-Based Scoring: Samples are annotated for educational value and coherence on a 0–5 scale using automated inference from models such as Qwen2-7b-instruct and Qwen2.5-14b-instruct. Only samples with scores are preserved.
- Deduplication: MinHash deduplication ensures minimal content overlap (e.g., ), reducing redundancy at massive scale.
- Automated Generation: Synthetic and dialogue data use generative models (glm4-9b-longwriter, Deepseek-V2.5) controlled via prompt templates, system prompts, and consistent randomization (e.g., temperature schedules).
- Reproducibility Controls: Entire pipelines employ reproducible designs—standardized prompt interfaces, batch seeding, and embedding-based similarity culling (using gte-zh-large for Smoltalk).
4. Empirical Evaluations and Quantitative Results
Empirical studies with the corpus demonstrate its training efficacy for Chinese LLMs:
- Pretraining on Fineweb-edu-chinese: A 2B-parameter Llama model trained on this data (, sequence length , learning rate ) significantly outperforms randomly sampled baselines from the same pools, exhibiting marked accuracy gains at milestone training steps (e.g., after 45k steps) on C-Eval and CMMLU benchmarks (Yu et al., 14 Jan 2025).
- Cosmopedia-chinese Impact: Though only modest quantitative improvements are observed after short fine-tuning (few epochs), human assessors judge the responses of LLMs pretrained on Cosmopedia to be notably more coherent and knowledge-rich.
- Smoltalk-chinese Alignment: Fine-tuning a pretrained LLM on Smoltalk yields superior downstream performance on Alignbench, exceeding other instruction-tuned datasets such as Infinity-Instruct and Magpie-Qwen2-Pro.
This validation pipeline leverages both automated metrics (accuracy, passage-level loss) and human evaluation (quality, informativeness).
5. Character-Level and Structural Feature Exploitation
Parallel research leverages intrinsic sinographic features to further augment corpus-based modeling:
- Sinographic structure is harnessed using directed inclusion graphs over allographic classes, with edges weighted for semanticity and phoneticity (Haralambous, 2014).
- The “most semantic subcharacter chain” is calculated recursively for each character by maximizing semantic weights along inclusion paths, with explicit formula:
Where are frequencies from WordNet-synset-based co-occurrences, and encodes Kang Xi radical similarity.
- Incorporating these hierarchical features into unigram text models yielded classification accuracy improvements from 89.61% to 92.62% (Chinese Sogou corpus, linear SVM) with a reduction in support vectors, demonstrating enhanced efficiency even on already strong baselines.
This methodology has broader applicability, including rare-character semantic approximation, concept network construction, and improved segmentation or cross-lingual search.
6. Applications: Conversational Agents and Dialogue Modeling
The CharacterDial Corpus directly supports large-scale dialogue modeling and conversational agent development:
- CharacterGLM (Zhou et al., 2023), a model family derived from ChatGLM, is optimized using corpora with character-centric profiles for generating highly consistent, human-like, and engaging dialogues. Models (6B–66B parameters) are conditioned on natural language prompts encoding static and dynamic character attributes.
- Training objective is formalized as:
where is the character prompt, is dialogue context, and is the response.
- Evaluations indicate that CharacterGLM models fine-tuned on these corpora outperform closed-source baselines (including GPT-3.5/4) in consistency, engagement, and user-aligned conversational behaviors.
The release of models and subset training data (e.g., 6B parameter version) ensures accessibility and reproducibility for research in personalized AI character generation, long-turn dialogue, and social-agent simulation.
7. Linguistic and Cross-Language Implications
The corpus’ statistical design and application illuminate several broader linguistic principles:
- Statistical regularity in human writing systems: Chinese constructive parts show rank-frequency distributions and mathematical fits congruent with those of alphabetic scripts, suggesting universal self-organizing language patterns (Chen et al., 2020).
- Analytical choices: Using constructive parts as “letters” in corpus design allows meaningful cross-orthographic, typological comparison and enables more granular, structurally aware computational methods.
- Educational and alignment-specific value: The inclusion of filtered educational content (Fineweb-edu) and synthetic expository writing (Cosmopedia) directly benefits LLMs’ reasoning capability, factual density, and context awareness.
- Future research trajectories include large-scale manual annotation of sinographic graphs, expansion to syntagmatic and higher-order linguistic constructs, and scalable, public dataset development pipelines for a wider range of Chinese NLP applications.
Summary
The Chinese CharacterDial Corpus, encompassing diverse and structurally motivated data resources, underpins state-of-the-art advances in Chinese LLM pretraining, evaluation, and conversational agent alignment. Its design integrates insights from statistical linguistics, sinographic structure analysis, and automated curation, further substantiated by substantial improvements in established benchmarks and downstream applications. The methodology and empirical findings reinforce the corpus’ centrality to current and prospective research in sinographic NLP, dialogue modeling, and computational linguistics.