HSKAgent: Chinese L2 Writing Evaluator
- HSKAgent is an automated evaluator for Chinese L2 writing that uses a Qwen3-8B transformer with LoRA adapters to perform grammar-item detection, error detection, and holistic scoring.
- It leverages both synthetic instruction data and manually annotated learner essays to achieve high agreement with human ratings and robust performance metrics.
- Its multi-task evaluation framework supports detailed SLA research by enabling dynamic assessment and tracking of language model progression under curriculum tuning.
HSKAgent is an automated evaluator designed for linguistically grounded assessment of Chinese second language (L2) writing, developed as part of the HSKBenchmark initiative for evaluating and modeling Chinese Second Language Acquisition (SLA) in LLMs through curriculum tuning. Built on the Qwen3-8B transformer architecture and fine-tuned via parameter-efficient LoRA adapters, HSKAgent performs multi-task evaluation on learner compositions, encompassing grammar-item detection, writing error detection, and holistic essay scoring. Its training leverages both synthetic instruction data covering comprehensive HSK grammar items and a large corpus of manually annotated compositions, enabling dynamic, reproducible, and high-fidelity measurement of Chinese L2 writing variables (Yang et al., 19 Nov 2025).
1. Model Architecture and Fine-Tuning
HSKAgent is instantiated from Qwen3-8B, an 8-billion-parameter transformer model with robust Chinese and multilingual capabilities, ranked highly in the SuperCLUE benchmark. Architecturally, Qwen3-8B remains unchanged; LoRA (Low-Rank Adaptation) adapters are applied for parameter-efficient fine-tuning across all phases.
The evaluation is decomposed into three primary tasks:
- Grammar-item detection: Binary classification for presence/absence of target grammar items.
- Error detection: Token-level classification distinguishing among five categories: character, lexical, syntactic, discourse errors, and correct tokens.
- Holistic scoring: Regression-based assignment of human-like overall writing scores on a 0–100 scale.
Fine-tuning employs a multi-task loss function:
with each component defined as follows:
where denotes number of grammar detection samples, the total tokens for error detection, error categories, the number of scoring samples, predicted and reference scores.
2. Training Data and Procedures
HSKAgent’s training process integrates both synthesized and authentic learner data.
Synthetic Instruction Data for Grammar Detection
- Source: 591 official HSK grammar items (Levels 3–6 and advanced), spanning lexical, phrasal, sentential, and emphatic structures.
- Generation: 16,462 samples produced via two-shot prompting with GPT-4.1-mini, DeepSeek-Chat-V3, and Gemini-2.5-Flash, at approximately 10 examples per grammar item.
- Verification: Triply annotated, exhibiting Fleiss’ and 95% validity for label correctness.
- Label Construction: Balanced positive/negative sampling, where negative completions derive from grammar-different items.
Human Learner Compositions
- Source: HSK Dynamic Composition Corpus v2.0, 10,000 manually labeled Chinese L2 essays (4 million characters) annotated for error spans and 0–100 holistic scores.
- Split: Five performance strata (60*, 70*, 80*, 90*, 95*), with 30 essay topics held out for final evaluation.
No additional data augmentation is used for error or holistic scoring tasks beyond label conversion; for grammar detection, negative samples are chosen to maximize grammatical distinction.
3. Evaluation Metrics and Linguistic Assessments
HSKAgent supports multi-faceted computation across five linguistic metrics:
- Coverage of Grammar Items
Measures proportion of HSK grammar points at level manifested in a text.
- Writing Error Rate
Aggregates four error categories per 100 characters.
- Lexical Complexity (MATTR-50)
Computes moving-average type–token ratio with window .
- Syntactic Complexity (MDD)
Captures mean dependency distance over sentence parses.
- Holistic Scoring
Estimated by the regression head of HSKAgent.
4. Empirical Results and Human-Like Acquisition Characteristics
HSKAgent demonstrates high levels of concordance with human annotation and robust discriminative capacity across all evaluation components. Performance metrics are summarized as follows:
| Task | Metric | HSKAgent |
|---|---|---|
| Grammar Detection | F1-score | 0.97 |
| Error Detection | Accuracy | 0.90 |
| Holistic Scoring | F1-score | 0.81 |
| Scoring Correlation | QWK / Spearman / Pearson | 0.797 / 0.801 / 0.802 |
- Human agreement: Token-level error detection reaches 90% accuracy with human raters. Essay holistic scoring correlates strongly, with Quadratic Weighted Kappa = 0.7969, Spearman = 0.8010, Pearson = 0.8023.
- Acquisition curve analysis: While HSKAgent serves as a non-learning evaluator, it enables precise tracking of model progression under curriculum tuning. For illustration, LLaMA2-7B-Chat’s holistic score increases from 70 (base) to 81.83 post HSK levels 3–6 curriculum, closely matching human learner score distributions by official rubric. Shuffled curricular ordering produces slower early progress and eventual plateaus, mirroring second language acquisition theory as per Krashen’s i+1 hypothesis. This suggests HSKAgent is not only an effective benchmark evaluator, but also reveals interpretable acquisition dynamics in LLMs.
5. Implementation Details and Reproducibility
- Framework: PyTorch 2.6.0; LoRA adapters (rank 4) for parameter-efficient fine-tuning; bf16 numerical precision.
- Hardware: Three NVIDIA RTX 3090 GPUs (24 GB each).
- Hyperparameters: Learning rate ; three epochs for all fine-tuning phases; batch size 32 for each task.
- Runtime: Approximately 4 hours per evaluation task per GPU configuration.
- Accessibility: All code, data, and model checkpoints (including HSKAgent_Qwen3-8B) are available at https://github.com/CharlesYang030/HSKB.
6. Significance and Applications in SLA Research
HSKAgent provides a reproducible, high-agreement automated evaluation suite for Chinese L2 writing assessment, with metrics and constructs paralleling human rubrics and error annotation standards. It enables detailed phase-wise modeling and benchmarking of LLMs for second language acquisition research, supports dynamic assessment during curriculum tuning, and standardizes experimental evaluation. Its integration into the HSKBenchmark suite facilitates empirical investigation of both interpretability and human-like learning dynamics in LLMs, laying a foundation for future advances in automated language education research at scale (Yang et al., 19 Nov 2025).