Papers
Topics
Authors
Recent
2000 character limit reached

HSKBenchmark: Chinese SLA Evaluation Suite

Updated 26 November 2025
  • HSKBenchmark is a technical evaluation suite that models and assesses Chinese second language acquisition using a curriculum-tuning methodology across HSK Levels 3–6.
  • It integrates a 6.76-million-token authentic textbook corpus, synthetic grammar prompts, and exam-style writing tasks to mirror authentic learning conditions.
  • The benchmark employs an automated HSKAgent and detailed linguistic metrics to evaluate grammar coverage, lexical diversity, syntactic complexity, and holistic writing performance.

HSKBenchmark refers to a technical evaluation suite and framework designed for modeling, benchmarking, and advancing the understanding of Chinese second language acquisition (SLA) in LLMs through a curriculum-tuning methodology. It targets phase-wise training and assessment of LLM writing abilities on authentic, level-segmented data reflecting the Hanyu Shuiping Kaoshi (HSK) proficiency standards, specifically HSK Levels 3–6. HSKBenchmark includes curated corpora, linguistically grounded evaluation criteria, synthetic writing task datasets, and an automated assessment agent, enabling systematic and reproducible paper of LLMs in Chinese SLA (Yang et al., 19 Nov 2025).

1. Data Composition and Corpus Structure

HSKBenchmark comprises three core components:

  • Authentic Textbook Corpus: A 6.76-million-token dataset, extracted from 79 widely-used Chinese-as-L2 textbooks (e.g., HSK Standard Course, Boya Chinese) covering HSK Levels 3–6. All Pinyin, English glosses, and images are removed.
Level Tokens Sentences Avg tokens/snt
HSK 3 895,037 22,743 39.35
HSK 4 1,473,516 34,637 42.66
HSK 5 1,717,178 41,044 41.84
HSK 6 2,678,621 63,650 42.08
Total 6,764,352 162,074 41.74
  • Synthetic Instruction Samples: 16,462 prompt-completion pairs, generated for 591 distinct grammar points defined by the Chinese Proficiency Grading Standards, covering types including word, phrase, fixed format, sentence component, sentence type, and emphatic usage. Generation employed GPT-4.1-mini, DeepSeek-Chat-V3, and Gemini-2.5-Flash with in-context 2-shot prompting. Manual verification (Fleiss’ κ = 0.91) ensured high validity (95%).
  • Test Topics: 30 exam-style writing prompts, sampled from the HSK Dynamic Composition Corpus v2.0, span major genres (narrative, argumentative, descriptive, expository) and pragmatic functions. All test topics were held out from training data.

2. Curriculum-Tuning Methodology

Training in HSKBenchmark follows a staged curriculum closely mirroring human SLA. The methodology advances models through HSK Levels 3→4→5→6, simulating the “i + 1” learning paradigm.

At each level ℓ, the process is:

  1. Causal LM Pretraining on the level-specific textbook set T()\mathcal{T}^{(\ell)}:

LPT()=xT()t=1xlogPθ(1)(xtx<t)\mathcal{L}_{PT}^{(\ell)} = -\sum_{x \in \mathcal{T}^{(\ell)}} \sum_{t=1}^{|x|} \log P_{\theta^{(\ell-1)}}(x_t | x_{<t})

  1. Instruction Tuning on the associated synthetic dataset D()\mathcal{D}^{(\ell)}:

LIT()=(p,y)D()t=1ylogPθPT()(ytp,y<t)\mathcal{L}_{IT}^{(\ell)} = -\sum_{(p, y) \in \mathcal{D}^{(\ell)}} \sum_{t=1}^{|y|} \log P_{\theta_{PT}^{(\ell)}}(y_t | p, y_{<t})

Model parameters evolve as:

θPT()=Pretrain(θ(1),T())\theta_{PT}^{(\ell)} = \text{Pretrain}(\theta^{(\ell-1)}, \mathcal{T}^{(\ell)})

θIT()=InstructTune(θPT(),D())\theta_{IT}^{(\ell)} = \text{InstructTune}(\theta_{PT}^{(\ell)}, \mathcal{D}^{(\ell)})

Optionally, aggregate objectives can employ curriculum loss weighting

Lcurric==36w(LPT()+LIT())\mathcal{L}_\text{curric} = \sum_{\ell=3}^6 w_\ell (\mathcal{L}_{PT}^{(\ell)} + \mathcal{L}_{IT}^{(\ell)})

where ww_\ell captures stage importance.

Base models include LLaMA2-7B-Chat, Mistral-7B-Instruct, and Chinese-Alpaca-2-7B, fine-tuned via LoRA. No curriculum baselines utilize GPT-4.1-mini, DeepSeek-Chat-V3, and Gemini-2.5-Flash.

3. Linguistically-Grounded Evaluation System

Evaluation in HSKBenchmark aims for multi-dimensional linguistic rigor using automated metrics and the HSKAgent:

  • Grammar Coverage (Cg()C_g^{(\ell)}): Fraction of distinct grammar points used at each HSK level—computed by binary classification across all 591 points.
  • Writing Errors: Quantified by the sum of character-level, lexical, syntactic, and discourse-level errors per output.
  • Lexical Complexity: Includes type-token ratio (TTR), moving-average TTR (MATTR), and measure of textual lexical diversity (MTLD).
  • Syntactic Complexity: Captures mean sentence length (MSL), mean dependency distance (MDD), subordination index (SI), and parse-tree depth (PTD).
  • Holistic Scoring: Mimics HSK criteria on a 0–100 scale, aggregating length, task fulfillment, cohesion, structural range, and accuracy. The automated rubric achieves quadratic weighted kappa (QWK) ≈ 0.80 with human gold.

4. Automated Assessment: HSKAgent

HSKAgent is a fine-tuned evaluation agent based on Qwen3-8B, extended with LoRA for all core linguistic tasks:

  • Grammar Detection: F1 = 0.97.
  • Error Detection: Accuracy = 90%.
  • Holistic Scoring: F1 = 0.81, QWK = 0.7969, Spearman = 0.8010, Pearson = 0.8023.

Datasets for HSKAgent include 16K positive/negative grammar instances and 10K human-annotated L2 essays with error tags and scores.

5. Experimental Results and Modeling Human-Like SLA Trajectories

Experimental results show that HSKBenchmark-trained LLMs not only model but emulate human-like second language acquisition patterns:

Model / Level Cg(3)C_g(3) Err MATTR MDD Score
Native speakers 0.341 1.40 0.806 2.977 88.33
HSK learners (95*) 0.356 2.87 0.817 2.839 85.00
GPT-4.1-mini 0.398 0.00 0.829 2.603 91.50
LLaMA2 (base) 0.484 0.90 0.686 2.425 70.00
LLaMA2_HSK3 0.492↑ 0.63↓ 0.719↑ 2.505↑ 75.83↑
LLaMA2_HSK6 (final) 0.425 0.53↓ 0.764↑ 2.556↑ 81.83↑

Key observations include:

  • Consistent monotonic improvement in grammar coverage, lexical, and syntactic complexity with curriculum progression.
  • Decreased error rates following each instruction-tuning phase.
  • Emergent usage and generalization behaviors at advanced levels.
  • Ablation with shuffled data yields inferior late-stage (HSK 5–6) performance versus strict curriculum, empirically supporting the curriculum-based “i + 1” paradigm in LLMs.
  • Curriculum-tuned LLMs maintain or slightly improve general Chinese and English benchmark scores, with no catastrophic forgetting.

6. Impact and Significance

HSKBenchmark establishes a systematic research platform for both dynamic modeling and assessment of Chinese SLA in LLMs. Its modular and phased design enables controlled evaluation of acquisition-phase sensitivity, emergent linguistic behaviors, and cross-task transfer in neural LLMs. The result is a benchmark suite that delivers both fine-grained, linguistically motivated metrics and reproducible assessment, facilitating future work in computational linguistics, interpretability, and educational technology for Chinese as a second language (Yang et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to HSKBenchmark.