HSKBenchmark: Chinese SLA Evaluation Suite

Updated 26 November 2025

HSKBenchmark is a technical evaluation suite that models and assesses Chinese second language acquisition using a curriculum-tuning methodology across HSK Levels 3–6.
It integrates a 6.76-million-token authentic textbook corpus, synthetic grammar prompts, and exam-style writing tasks to mirror authentic learning conditions.
The benchmark employs an automated HSKAgent and detailed linguistic metrics to evaluate grammar coverage, lexical diversity, syntactic complexity, and holistic writing performance.

HSKBenchmark refers to a technical evaluation suite and framework designed for modeling, benchmarking, and advancing the understanding of Chinese second language acquisition (SLA) in LLMs through a curriculum-tuning methodology. It targets phase-wise training and assessment of LLM writing abilities on authentic, level-segmented data reflecting the Hanyu Shuiping Kaoshi (HSK) proficiency standards, specifically HSK Levels 3–6. HSKBenchmark includes curated corpora, linguistically grounded evaluation criteria, synthetic writing task datasets, and an automated assessment agent, enabling systematic and reproducible study of LLMs in Chinese SLA (Yang et al., 19 Nov 2025).

1. Data Composition and Corpus Structure

HSKBenchmark comprises three core components:

Authentic Textbook Corpus: A 6.76-million-token dataset, extracted from 79 widely-used Chinese-as-L2 textbooks (e.g., HSK Standard Course, Boya Chinese) covering HSK Levels 3–6. All Pinyin, English glosses, and images are removed.

Level	Tokens	Sentences	Avg tokens/snt
HSK 3	895,037	22,743	39.35
HSK 4	1,473,516	34,637	42.66
HSK 5	1,717,178	41,044	41.84
HSK 6	2,678,621	63,650	42.08
Total	6,764,352	162,074	41.74

Synthetic Instruction Samples: 16,462 prompt-completion pairs, generated for 591 distinct grammar points defined by the Chinese Proficiency Grading Standards, covering types including word, phrase, fixed format, sentence component, sentence type, and emphatic usage. Generation employed GPT-4.1-mini, DeepSeek-Chat-V3, and Gemini-2.5-Flash with in-context 2-shot prompting. Manual verification (Fleiss’ κ = 0.91) ensured high validity (95%).
Test Topics: 30 exam-style writing prompts, sampled from the HSK Dynamic Composition Corpus v2.0, span major genres (narrative, argumentative, descriptive, expository) and pragmatic functions. All test topics were held out from training data.

2. Curriculum-Tuning Methodology

Training in HSKBenchmark follows a staged curriculum closely mirroring human SLA. The methodology advances models through HSK Levels 3→4→5→6, simulating the “i + 1” learning paradigm.

At each level ℓ, the process is:

Causal LM Pretraining on the level-specific textbook set $\mathcal{T}^{(\ell)}$ :

$\mathcal{L}_{PT}^{(\ell)} = -\sum_{x \in \mathcal{T}^{(\ell)}} \sum_{t=1}^{|x|} \log P_{\theta^{(\ell-1)}}(x_t | x_{<t})$

Instruction Tuning on the associated synthetic dataset $\mathcal{D}^{(\ell)}$ :

$\mathcal{L}_{IT}^{(\ell)} = -\sum_{(p, y) \in \mathcal{D}^{(\ell)}} \sum_{t=1}^{|y|} \log P_{\theta_{PT}^{(\ell)}}(y_t | p, y_{<t})$

Model parameters evolve as:

$\theta_{PT}^{(\ell)} = \text{Pretrain}(\theta^{(\ell-1)}, \mathcal{T}^{(\ell)})$

$\theta_{IT}^{(\ell)} = \text{InstructTune}(\theta_{PT}^{(\ell)}, \mathcal{D}^{(\ell)})$

Optionally, aggregate objectives can employ curriculum loss weighting

$\mathcal{L}_\text{curric} = \sum_{\ell=3}^6 w_\ell (\mathcal{L}_{PT}^{(\ell)} + \mathcal{L}_{IT}^{(\ell)})$

where $w_\ell$ captures stage importance.

Base models include LLaMA2-7B-Chat, Mistral-7B-Instruct, and Chinese-Alpaca-2-7B, fine-tuned via LoRA. No curriculum baselines utilize GPT-4.1-mini, DeepSeek-Chat-V3, and Gemini-2.5-Flash.

3. Linguistically-Grounded Evaluation System

Evaluation in HSKBenchmark aims for multi-dimensional linguistic rigor using automated metrics and the HSKAgent:

Grammar Coverage ( $C_g^{(\ell)}$ ): Fraction of distinct grammar points used at each HSK level—computed by binary classification across all 591 points.
Writing Errors: Quantified by the sum of character-level, lexical, syntactic, and discourse-level errors per output.
Lexical Complexity: Includes type-token ratio (TTR), moving-average TTR (MATTR), and measure of textual lexical diversity (MTLD).
Syntactic Complexity: Captures mean sentence length (MSL), mean dependency distance (MDD), subordination index (SI), and parse-tree depth (PTD).
Holistic Scoring: Mimics HSK criteria on a 0–100 scale, aggregating length, task fulfillment, cohesion, structural range, and accuracy. The automated rubric achieves quadratic weighted kappa (QWK) ≈ 0.80 with human gold.

4. Automated Assessment: HSKAgent

HSKAgent is a fine-tuned evaluation agent based on Qwen3-8B, extended with LoRA for all core linguistic tasks:

Grammar Detection: F1 = 0.97.
Error Detection: Accuracy = 90%.
Holistic Scoring: F1 = 0.81, QWK = 0.7969, Spearman = 0.8010, Pearson = 0.8023.

Datasets for HSKAgent include 16K positive/negative grammar instances and 10K human-annotated L2 essays with error tags and scores.

5. Experimental Results and Modeling Human-Like SLA Trajectories

Experimental results show that HSKBenchmark-trained LLMs not only model but emulate human-like second language acquisition patterns:

Model / Level	$C_g(3)$	Err	MATTR	MDD	Score
Native speakers	0.341	1.40	0.806	2.977	88.33
HSK learners (95*)	0.356	2.87	0.817	2.839	85.00
GPT-4.1-mini	0.398	0.00	0.829	2.603	91.50
LLaMA2 (base)	0.484	0.90	0.686	2.425	70.00
LLaMA2_HSK3	0.492↑	0.63↓	0.719↑	2.505↑	75.83↑
LLaMA2_HSK6 (final)	0.425	0.53↓	0.764↑	2.556↑	81.83↑

Key observations include:

Consistent monotonic improvement in grammar coverage, lexical, and syntactic complexity with curriculum progression.
Decreased error rates following each instruction-tuning phase.
Emergent usage and generalization behaviors at advanced levels.
Ablation with shuffled data yields inferior late-stage (HSK 5–6) performance versus strict curriculum, empirically supporting the curriculum-based “i + 1” paradigm in LLMs.
Curriculum-tuned LLMs maintain or slightly improve general Chinese and English benchmark scores, with no catastrophic forgetting.

6. Impact and Significance

HSKBenchmark establishes a systematic research platform for both dynamic modeling and assessment of Chinese SLA in LLMs. Its modular and phased design enables controlled evaluation of acquisition-phase sensitivity, emergent linguistic behaviors, and cross-task transfer in neural LLMs. The result is a benchmark suite that delivers both fine-grained, linguistically motivated metrics and reproducible assessment, facilitating future work in computational linguistics, interpretability, and educational technology for Chinese as a second language (Yang et al., 19 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HSKBenchmark.