HSKAgent: Chinese L2 Writing Evaluator

Updated 26 November 2025

HSKAgent is an automated evaluator for Chinese L2 writing that uses a Qwen3-8B transformer with LoRA adapters to perform grammar-item detection, error detection, and holistic scoring.
It leverages both synthetic instruction data and manually annotated learner essays to achieve high agreement with human ratings and robust performance metrics.
Its multi-task evaluation framework supports detailed SLA research by enabling dynamic assessment and tracking of language model progression under curriculum tuning.

HSKAgent is an automated evaluator designed for linguistically grounded assessment of Chinese second language (L2) writing, developed as part of the HSKBenchmark initiative for evaluating and modeling Chinese Second Language Acquisition (SLA) in LLMs through curriculum tuning. Built on the Qwen3-8B transformer architecture and fine-tuned via parameter-efficient LoRA adapters, HSKAgent performs multi-task evaluation on learner compositions, encompassing grammar-item detection, writing error detection, and holistic essay scoring. Its training leverages both synthetic instruction data covering comprehensive HSK grammar items and a large corpus of manually annotated compositions, enabling dynamic, reproducible, and high-fidelity measurement of Chinese L2 writing variables (Yang et al., 19 Nov 2025).

1. Model Architecture and Fine-Tuning

HSKAgent is instantiated from Qwen3-8B, an 8-billion-parameter transformer model with robust Chinese and multilingual capabilities, ranked highly in the SuperCLUE benchmark. Architecturally, Qwen3-8B remains unchanged; LoRA (Low-Rank Adaptation) adapters are applied for parameter-efficient fine-tuning across all phases.

The evaluation is decomposed into three primary tasks:

Grammar-item detection: Binary classification for presence/absence of target grammar items.
Error detection: Token-level classification distinguishing among five categories: character, lexical, syntactic, discourse errors, and correct tokens.
Holistic scoring: Regression-based assignment of human-like overall writing scores on a 0–100 scale.

Fine-tuning employs a multi-task loss function:

$\mathcal{L}_{\mathrm{HSKAgent}} = \mathcal{L}_{\mathrm{gram}} + \mathcal{L}_{\mathrm{err}} + \mathcal{L}_{\mathrm{score}}$

with each component defined as follows:

$\mathcal{L}_{\mathrm{gram}} = -\frac{1}{N} \sum_{i=1}^N \bigl[y_i \log p_\theta(1|x_i) + (1-y_i)\log p_\theta(0|x_i)\bigr]$

$\mathcal{L}_{\mathrm{err}} = -\frac{1}{T} \sum_{t=1}^T \sum_{k=1}^K \mathbf{1}(y_t=k) \log p_\theta(k|\mathbf{x},t)$

$\mathcal{L}_{\mathrm{score}} = \frac{1}{M} \sum_{j=1}^M (\hat s_j - s_j)^2$

where $N$ denotes number of grammar detection samples, $T$ the total tokens for error detection, $K=5$ error categories, $M$ the number of scoring samples, $\hat s_j$ predicted and $s_j$ reference scores.

2. Training Data and Procedures

HSKAgent’s training process integrates both synthesized and authentic learner data.

Synthetic Instruction Data for Grammar Detection

Source: 591 official HSK grammar items (Levels 3–6 and advanced), spanning lexical, phrasal, sentential, and emphatic structures.
Generation: 16,462 samples produced via two-shot prompting with GPT-4.1-mini, DeepSeek-Chat-V3, and Gemini-2.5-Flash, at approximately 10 examples per grammar item.
Verification: Triply annotated, exhibiting Fleiss’ $\kappa = 0.91$ and 95% validity for label correctness.
Label Construction: Balanced positive/negative sampling, where negative completions derive from grammar-different items.

Human Learner Compositions

Source: HSK Dynamic Composition Corpus v2.0, 10,000 manually labeled Chinese L2 essays (4 million characters) annotated for error spans and 0–100 holistic scores.
Split: Five performance strata (60*, 70*, 80*, 90*, 95*), with 30 essay topics held out for final evaluation.

No additional data augmentation is used for error or holistic scoring tasks beyond label conversion; for grammar detection, negative samples are chosen to maximize grammatical distinction.

3. Evaluation Metrics and Linguistic Assessments

HSKAgent supports multi-faceted computation across five linguistic metrics:

Coverage of Grammar Items $(\mathrm{Cov}_l)$

$\mathrm{Cov}_l = \frac{|\{\text{grammar items at level }l\}|}{\sum_k |\{\text{all items}\}|}$

Measures proportion of HSK grammar points at level $l$ manifested in a text.

Writing Error Rate $(\mathrm{Err})$

$\mathrm{Err} = \frac{\#\,\mathrm{errors}}{\#\,\mathrm{characters} \times 100}$

Aggregates four error categories per 100 characters.

Lexical Complexity (MATTR-50)

$\mathrm{MATTR} = \frac{1}{N-w+1} \sum_{i=1}^{N-w+1} \frac{|\mathrm{Types}(i,\dots,i+w-1)|}{w}$

Computes moving-average type–token ratio with window $w=50$ .

Syntactic Complexity (MDD)

$\mathrm{MDD} = \frac{1}{L} \sum_{j=1}^L |j - \mathrm{head}(j)|$

Captures mean dependency distance over sentence parses.

Holistic Scoring

$\text{Predicted as a real-valued score } [0, 100]$

Estimated by the regression head of HSKAgent.

4. Empirical Results and Human-Like Acquisition Characteristics

HSKAgent demonstrates high levels of concordance with human annotation and robust discriminative capacity across all evaluation components. Performance metrics are summarized as follows:

Task	Metric	HSKAgent
Grammar Detection	F1-score	0.97
Error Detection	Accuracy	0.90
Holistic Scoring	F1-score	0.81
Scoring Correlation	QWK / Spearman / Pearson	0.797 / 0.801 / 0.802

Human agreement: Token-level error detection reaches 90% accuracy with human raters. Essay holistic scoring correlates strongly, with Quadratic Weighted Kappa = 0.7969, Spearman = 0.8010, Pearson = 0.8023.
Acquisition curve analysis: While HSKAgent serves as a non-learning evaluator, it enables precise tracking of model progression under curriculum tuning. For illustration, LLaMA2-7B-Chat’s holistic score increases from 70 (base) to 81.83 post HSK levels 3–6 curriculum, closely matching human learner score distributions by official rubric. Shuffled curricular ordering produces slower early progress and eventual plateaus, mirroring second language acquisition theory as per Krashen’s i+1 hypothesis. This suggests HSKAgent is not only an effective benchmark evaluator, but also reveals interpretable acquisition dynamics in LLMs.

5. Implementation Details and Reproducibility

Framework: PyTorch 2.6.0; LoRA adapters (rank 4) for parameter-efficient fine-tuning; bf16 numerical precision.
Hardware: Three NVIDIA RTX 3090 GPUs (24 GB each).
Hyperparameters: Learning rate $5 \times 10^{-5}$ ; three epochs for all fine-tuning phases; batch size 32 for each task.
Runtime: Approximately 4 hours per evaluation task per GPU configuration.
Accessibility: All code, data, and model checkpoints (including HSKAgent_Qwen3-8B) are available at https://github.com/CharlesYang030/HSKB.

6. Significance and Applications in SLA Research

HSKAgent provides a reproducible, high-agreement automated evaluation suite for Chinese L2 writing assessment, with metrics and constructs paralleling human rubrics and error annotation standards. It enables detailed phase-wise modeling and benchmarking of LLMs for second language acquisition research, supports dynamic assessment during curriculum tuning, and standardizes experimental evaluation. Its integration into the HSKBenchmark suite facilitates empirical investigation of both interpretability and human-like learning dynamics in LLMs, laying a foundation for future advances in automated language education research at scale (Yang et al., 19 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning (2025)

HSKAgent: Chinese L2 Writing Evaluator

1. Model Architecture and Fine-Tuning

2. Training Data and Procedures

Synthetic Instruction Data for Grammar Detection

Human Learner Compositions

3. Evaluation Metrics and Linguistic Assessments

4. Empirical Results and Human-Like Acquisition Characteristics

5. Implementation Details and Reproducibility

6. Significance and Applications in SLA Research

Whiteboard

Follow Topic

Continue Learning

HSKAgent: Chinese L2 Writing Evaluator

1. Model Architecture and Fine-Tuning

2. Training Data and Procedures

Synthetic Instruction Data for Grammar Detection

Human Learner Compositions

3. Evaluation Metrics and Linguistic Assessments

4. Empirical Results and Human-Like Acquisition Characteristics

5. Implementation Details and Reproducibility

6. Significance and Applications in SLA Research

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics