SpeechHSK: Chinese Listening Comprehension Benchmark
- SpeechHSK is a hierarchical benchmark designed to evaluate Chinese listening comprehension using authentic HSK exam recordings and standardized audio options.
- It provides a multi-tier diagnostic framework across HSK levels 1–6, enabling precise accuracy comparisons and error analysis in audio models.
- Benchmark results reveal near-perfect performance on lower levels by top models and highlight challenges for multilingual systems on advanced tiers.
SpeechHSK is a hierarchical benchmark for evaluating Chinese-language listening comprehension in audio foundation models. Developed within the UltraEval-Audio framework, SpeechHSK provides a rigorous, multi-level resource tailored to the diagnostic assessment of end-to-end Chinese listening capabilities—addressing the notable absence of high-fidelity, language-specific evaluation protocols in speech AI. Drawing on the Hanyu Shuiping Kaoshi (HSK), China’s official proficiency exam, SpeechHSK features authentic exam recordings and controlled, standardized option utterances, enabling precise accuracy-based comparison across six progressively difficult proficiency tiers (Shi et al., 4 Jan 2026).
1. Motivation and Scope
SpeechHSK was motivated by the disproportionate focus on English in existing speech and multimodal benchmarks such as SpeechTriviaQA, SpeechWebQuestions, and SpeechAlpacaEval. This English-centric bias has resulted in suboptimal diagnostic coverage of audio models’ performance on Chinese listening and comprehension tasks. The benchmark leverages the multi-level organizational structure of the HSK system (Levels 1–6), directly incorporating hierarchically tiered exam recordings and options to mirror real-world language proficiency assessments.
Key objectives include:
- Providing a Chinese-exclusive evaluation resource to test models beyond English-centric benchmarks.
- Facilitating diagnostic error analysis through preservation of the official HSK level structure.
- Maximizing ecological validity via naturalistic stimulus selection: authentic exam recordings for questions and native Mandarin re-recordings for answer options.
2. Data Organization and Recording Protocol
SpeechHSK comprises 170 multiple-choice listening comprehension items. The number of samples per proficiency level increases with difficulty, in accordance with the HSK exam’s distribution: lower levels contain approximately 15–25 items each, while advanced levels reach 30–40 samples. Each sample pairs one original exam audio question with four option utterances, rerecorded in studio conditions to strict quality specifications.
Key dataset features:
- Utterance types: Each item consists of “question” audio and four “option” audios.
- Speaker demographics: Options are voiced by eight native Putonghua speakers, balanced for gender, with no regional dialects.
- Recording environment: Acoustic isolation below 30 dB A ambient noise; Neumann TLM 103 microphones (44.1 kHz/16 bit), Universal Audio LA-610 pre-amplification.
- Quality control: Two-stage review: (1) automatic transcription check (ASR, CER = 0%), (2) manual listening pass (prosody/naturalness, artifact check).
| Level | Item Count (Approx.) | Audio Sources |
|---|---|---|
| HSK 1-2 | 15–25 | Official exam + TTS re-recordings |
| HSK 3-4 | 25–30 | Official exam + TTS re-recordings |
| HSK 5-6 | 30–40 | Official exam + TTS re-recordings |
Item counts are distributed to match official HSK specifications; audio for options is strictly standard Putonghua.
3. Annotation Schema and Labeling Protocols
Each sample in SpeechHSK is paired with comprehensive metadata:
- Transcriptions: Official HSK script for question and options, included for reference and downstream analysis.
- Proficiency labels: Item-specific HSK level identifier (1–6), enabling stratified error analysis.
- Ground-truth answers: Derived directly from the official HSK answer key; all items feature deterministic correct choices.
- Fluency ratings: Not required due to professional recording controls and direct script usage.
- Inter-annotator agreement: Not applicable; accuracy is referenced to the exam’s deterministic answer key.
4. Benchmark Task Structure and Evaluation Metric
The benchmark task is structured as multiple-choice listening comprehension:
- Input: Audio of the question stem plus audio for each of the four re-recorded answer options.
- Output: Model selects one option (A/B/C/D).
Performance is quantified via unweighted accuracy:
where . There is no composite scoring or differential weighting; all items are evaluated on equal terms.
5. Model Performance and Baseline Results
SpeechHSK provides robust baseline comparisons for six state-of-the-art audio foundation models, all tested under the identical prompt “Listen and choose A/B/C/D.” Spoken choices are post-processed to text using Paraformer-zh.
| Model Name | SpeechHSK Accuracy (%) |
|---|---|
| GPT-4o-Realtime | 98.69 |
| Kimi-Audio-7B-Instruct | 97.42 |
| Qwen2.5-Omni | 95.65 |
| MiniCPM-o 2.6 | 80.68 |
| GLM-4-Voice | 71.06 |
| Qwen3-Omni-30B-A3B-Instruct | 40.27 |
Top performers—GPT-4o-Realtime, Qwen2.5-Omni, and Kimi-Audio—demonstrate near-perfect scores, indicating high efficacy in end-to-end Chinese listening comprehension. In contrast, certain large “multilingual” models (e.g., Qwen3-Omni-30B) underperform, suggesting inadequate Chinese-specific pretraining or model architecture for this language.
A notable task property is the diagnostic stratification enabled by multi-level proficiency: most models’ error rates increase markedly at HSK Levels 5–6, facilitating granular difficulty analysis.
6. Significance and Recommendations
SpeechHSK delivers a compact, multi-level framework for reliable and fine-grained evaluation of Chinese listening comprehension in AI models. Its key advantages include authentic stimuli, deterministic ground-truth, comprehensive baseline coverage, and a clear, reproducible protocol.
Recommendations for future development include:
- Expansion of sample quantity per proficiency level to improve measurement stability.
- Provision of detailed speaker metadata (age, gender) to facilitate in-depth analysis of speaker-model mismatches.
- Extension to open-ended and spoken-answer formats for wider coverage of spoken language abilities.
SpeechHSK supports transparent, efficient, and fair benchmarking for researchers seeking to advance Chinese language proficiency in audio foundation models. Its direct public availability—including audio, transcripts, and leaderboards—facilitates immediate integration into model development and evaluation workflows (Shi et al., 4 Jan 2026).