HSKBenchmark: Chinese SLA Evaluation Suite
- HSKBenchmark is a technical evaluation suite that models and assesses Chinese second language acquisition using a curriculum-tuning methodology across HSK Levels 3–6.
- It integrates a 6.76-million-token authentic textbook corpus, synthetic grammar prompts, and exam-style writing tasks to mirror authentic learning conditions.
- The benchmark employs an automated HSKAgent and detailed linguistic metrics to evaluate grammar coverage, lexical diversity, syntactic complexity, and holistic writing performance.
HSKBenchmark refers to a technical evaluation suite and framework designed for modeling, benchmarking, and advancing the understanding of Chinese second language acquisition (SLA) in LLMs through a curriculum-tuning methodology. It targets phase-wise training and assessment of LLM writing abilities on authentic, level-segmented data reflecting the Hanyu Shuiping Kaoshi (HSK) proficiency standards, specifically HSK Levels 3–6. HSKBenchmark includes curated corpora, linguistically grounded evaluation criteria, synthetic writing task datasets, and an automated assessment agent, enabling systematic and reproducible paper of LLMs in Chinese SLA (Yang et al., 19 Nov 2025).
1. Data Composition and Corpus Structure
HSKBenchmark comprises three core components:
- Authentic Textbook Corpus: A 6.76-million-token dataset, extracted from 79 widely-used Chinese-as-L2 textbooks (e.g., HSK Standard Course, Boya Chinese) covering HSK Levels 3–6. All Pinyin, English glosses, and images are removed.
| Level | Tokens | Sentences | Avg tokens/snt |
|---|---|---|---|
| HSK 3 | 895,037 | 22,743 | 39.35 |
| HSK 4 | 1,473,516 | 34,637 | 42.66 |
| HSK 5 | 1,717,178 | 41,044 | 41.84 |
| HSK 6 | 2,678,621 | 63,650 | 42.08 |
| Total | 6,764,352 | 162,074 | 41.74 |
- Synthetic Instruction Samples: 16,462 prompt-completion pairs, generated for 591 distinct grammar points defined by the Chinese Proficiency Grading Standards, covering types including word, phrase, fixed format, sentence component, sentence type, and emphatic usage. Generation employed GPT-4.1-mini, DeepSeek-Chat-V3, and Gemini-2.5-Flash with in-context 2-shot prompting. Manual verification (Fleiss’ κ = 0.91) ensured high validity (95%).
- Test Topics: 30 exam-style writing prompts, sampled from the HSK Dynamic Composition Corpus v2.0, span major genres (narrative, argumentative, descriptive, expository) and pragmatic functions. All test topics were held out from training data.
2. Curriculum-Tuning Methodology
Training in HSKBenchmark follows a staged curriculum closely mirroring human SLA. The methodology advances models through HSK Levels 3→4→5→6, simulating the “i + 1” learning paradigm.
At each level ℓ, the process is:
- Causal LM Pretraining on the level-specific textbook set :
- Instruction Tuning on the associated synthetic dataset :
Model parameters evolve as:
Optionally, aggregate objectives can employ curriculum loss weighting
where captures stage importance.
Base models include LLaMA2-7B-Chat, Mistral-7B-Instruct, and Chinese-Alpaca-2-7B, fine-tuned via LoRA. No curriculum baselines utilize GPT-4.1-mini, DeepSeek-Chat-V3, and Gemini-2.5-Flash.
3. Linguistically-Grounded Evaluation System
Evaluation in HSKBenchmark aims for multi-dimensional linguistic rigor using automated metrics and the HSKAgent:
- Grammar Coverage (): Fraction of distinct grammar points used at each HSK level—computed by binary classification across all 591 points.
- Writing Errors: Quantified by the sum of character-level, lexical, syntactic, and discourse-level errors per output.
- Lexical Complexity: Includes type-token ratio (TTR), moving-average TTR (MATTR), and measure of textual lexical diversity (MTLD).
- Syntactic Complexity: Captures mean sentence length (MSL), mean dependency distance (MDD), subordination index (SI), and parse-tree depth (PTD).
- Holistic Scoring: Mimics HSK criteria on a 0–100 scale, aggregating length, task fulfillment, cohesion, structural range, and accuracy. The automated rubric achieves quadratic weighted kappa (QWK) ≈ 0.80 with human gold.
4. Automated Assessment: HSKAgent
HSKAgent is a fine-tuned evaluation agent based on Qwen3-8B, extended with LoRA for all core linguistic tasks:
- Grammar Detection: F1 = 0.97.
- Error Detection: Accuracy = 90%.
- Holistic Scoring: F1 = 0.81, QWK = 0.7969, Spearman = 0.8010, Pearson = 0.8023.
Datasets for HSKAgent include 16K positive/negative grammar instances and 10K human-annotated L2 essays with error tags and scores.
5. Experimental Results and Modeling Human-Like SLA Trajectories
Experimental results show that HSKBenchmark-trained LLMs not only model but emulate human-like second language acquisition patterns:
| Model / Level | Err | MATTR | MDD | Score | |
|---|---|---|---|---|---|
| Native speakers | 0.341 | 1.40 | 0.806 | 2.977 | 88.33 |
| HSK learners (95*) | 0.356 | 2.87 | 0.817 | 2.839 | 85.00 |
| GPT-4.1-mini | 0.398 | 0.00 | 0.829 | 2.603 | 91.50 |
| LLaMA2 (base) | 0.484 | 0.90 | 0.686 | 2.425 | 70.00 |
| LLaMA2_HSK3 | 0.492↑ | 0.63↓ | 0.719↑ | 2.505↑ | 75.83↑ |
| LLaMA2_HSK6 (final) | 0.425 | 0.53↓ | 0.764↑ | 2.556↑ | 81.83↑ |
Key observations include:
- Consistent monotonic improvement in grammar coverage, lexical, and syntactic complexity with curriculum progression.
- Decreased error rates following each instruction-tuning phase.
- Emergent usage and generalization behaviors at advanced levels.
- Ablation with shuffled data yields inferior late-stage (HSK 5–6) performance versus strict curriculum, empirically supporting the curriculum-based “i + 1” paradigm in LLMs.
- Curriculum-tuned LLMs maintain or slightly improve general Chinese and English benchmark scores, with no catastrophic forgetting.
6. Impact and Significance
HSKBenchmark establishes a systematic research platform for both dynamic modeling and assessment of Chinese SLA in LLMs. Its modular and phased design enables controlled evaluation of acquisition-phase sensitivity, emergent linguistic behaviors, and cross-task transfer in neural LLMs. The result is a benchmark suite that delivers both fine-grained, linguistically motivated metrics and reproducible assessment, facilitating future work in computational linguistics, interpretability, and educational technology for Chinese as a second language (Yang et al., 19 Nov 2025).