SpeechCMMLU: Chinese Audio QA Benchmark
- The paper introduces SpeechCMMLU, an evaluation benchmark for Chinese spoken question answering that leverages TTS-based synthesis and zero-error ASR filtering for high-quality audio prompts.
- It converts text-based CMMLU questions into comprehensive spoken prompts, enabling rigorous assessment of audio foundation models across diverse subject areas.
- Baseline results reveal varied model performance, highlighting the benchmarkās potential to drive advances in Chinese audio processing and reasoning.
SpeechCMMLU is an evaluation benchmark introduced to fill the critical gap in large-scale, high-level Chinese-language spoken question answering resources, paralleling the established CMMLU (Chinese Massive Multi-task Language Understanding) benchmark in the speech modality. Developed as part of the UltraEval-Audio suite, SpeechCMMLU facilitates rigorous assessment of audio foundation modelsā Chinese world-knowledge comprehension and reasoning on spoken multiple-choice questions, enabling reproducible and efficient cross-model comparisons in audio-based QA for the Chinese language (Shi et al., 4 Jan 2026).
1. Motivation and Benchmark Design
SpeechCMMLU was developed to address three primary deficits in the landscape of audio model evaluation. First, existing QA-style speech benchmarks, such as SpeechTriviaQA, SpeechWebQuestions, and SpeechAlpacaEval, are exclusively English-based, precluding comprehensive evaluation of Chinese language understanding in audio models. Second, there was no significant established large-scale benchmark for high-level spoken QA tasks in Chinese. Third, the reproducibility and quality control of audio-based benchmarks required a robust, standardized construction methodology.
The benchmark distinguishes itself in several ways:
- Input Modality: Each input is provided as a raw speech audio prompt rather than as text.
- Language: All questions and options are in Chinese.
- Task Format: Each item is a multiple-choice question with four answer options (AāD), all rendered in a single audio prompt. Models must generate their answer as either a spoken letter or its textual representation.
- Reproducibility: All samples are synthesized using TTS and undergo strict automated validation to minimize errors and variability.
Compared to the original text-based CMMLU, SpeechCMMLU evaluates the full auditory processing pipeline, mapping from speech comprehension to knowledge reasoning and spoken/textual answer generation.
2. Dataset Composition and Coverage
SpeechCMMLU comprises 3,519 four-option, multiple-choice spoken QA items. These were selected from the original 11,583 CMMLU questions following a two-stage process: TTS-based audio construction and stringent quality filtering.
- Domain Coverage: The benchmark covers the same diverse range of academic and professional subjects as CMMLU, including mathematics, computer science, history, law, medicine, chemistry, physics, art, philosophy, economics, politics, geography, among others. The precise number of categories and their individual distributions are not specified.
- Sample Distribution: All retained samples satisfy a zero Character Error Rate (CER) filter on ASR transcription; no further label balancing or post hoc stratification is applied.
- Data Split: SpeechCMMLU is designed solely as an evaluation setāno training, development, or test splits are provided.
- Item Structure: Each sample consists of a compound audio prompt embedding both the question stem and the four answer options, synthesized as a contiguous waveform.
3. Audio Generation and Preprocessing Pipeline
Audio prompts are constructed through the following workflow:
- Each question-plus-options prompt is synthesized using CosyVoice2, a high-fidelity Chinese TTS engine, ensuring proper pronunciation and fidelity for specialized terminology.
- The resulting waveform is transcribed with a robust Mandarin ASR model (Paraformer-zh).
- Only those samples with Character Error Rate equal to 0%āwhere the ASR transcript exactly matches the intended textāare retained. Any samples with transcription discrepancies are discarded.
This strict pipeline ensures that the final corpus contains only audio prompts with no TTS or pronunciation anomalies. While the standard audio parameters (e.g., 16 kHz WAV) are presumed, explicit details on format and duration are not documented.
4. Annotation, Ground Truth, and Quality Assurance
Ground truth correctness for each item is inherited directly from the CMMLU text annotations, with each sample specifying one correct answer (A, B, C, or D).
- Annotation Source: Correct answers are not re-annotated or revised for the speech version.
- Quality Control: The CER=0% criterion acts as the sole quality filter, ensuring that all audio prompts exactly correspond to the intended question and answer options. No further checks (such as human annotation consistency or perceptual validation) are specified.
- Assumption: The benchmark assumes the gold status of the original CMMLU annotations and relies on automated processes for audio validation.
5. Evaluation Criteria and Scoring
The evaluation metric for SpeechCMMLU is classification accuracy, defined as the proportion of items for which the model's predicted option matches the ground-truth label:
where is the number of correct predictions and is the total benchmark size. No composite, weighted, or per-domain measures are applied in this benchmark.
6. Baseline Results and Cross-Model Comparisons
Table 6 of (Shi et al., 4 Jan 2026) reports SpeechCMMLU performance across six representative audio foundation models, reflecting varying approaches and degrees of openness. The following results are provided in terms of raw accuracy:
| Model | SpeechCMMLU Acc. (ā) |
|---|---|
| GPT-4o-Realtime | 70.05āÆ% |
| Qwen3-Omni-30B-A3B-Instruct | 47.83āÆ% |
| Qwen2.5-Omni | 73.72āÆ% |
| MiniCPM-o 2.6 | 51.37āÆ% |
| Kimi-Audio-7B-Instruct | 71.25āÆ% |
| GLM-4-Voice | 52.61āÆ% |
The highest-performing models (GPT-4o-Realtime, Qwen2.5-Omni, Kimi-Audio-7B-Instruct) achieve accuracy rates in the 70ā74% range, while other open-source and proprietary models show more modest outcomes (ā48ā52%). The spread in accuracy indicates continued headroom for improvement in Chinese spoken QA. The authors observe that open-source models have narrowed the gap with proprietary systems for Chinese world-knowledge reasoning, though accuracy remains below human-level performance.
7. Task Design and Sample Illustration
A typical SpeechCMMLU sample consists of a single audio file with both the question stem and four options read aloud in sequence. The model receives this audio prompt and produces its answerāthe letter corresponding to the chosen optionāas either speech or text.
Adapted examples (not verbatim from the benchmark, but constructed following its principles) include:
- History Example:
Audio Prompt: āéę©äøåå锹äøļ¼ē¬¬äøę¬”éø¦ēęäŗēåēę¶é“ęę„čæåŖäøå¹“ļ¼ A. 1836幓ļ¼B. 1839幓ļ¼C. 1842幓ļ¼D. 1856幓ćā Correct answer: B
- Chemistry Example:
Audio Prompt: āå·²ē„äøåå ē“ ēåååŗę°ļ¼ē¢³äøŗ6ļ¼ę°§äøŗ8ļ¼ę°®äøŗ7ļ¼ę°äøŗ10ćåŖäøē§å ē“ ēåååŗę°ę大ļ¼A. 碳ļ¼B. ę°§ļ¼C. ę°®ļ¼D. ę°ćā Correct answer: D
During evaluation, the model's response is compared to the annotated answer, and aggregate accuracy is reported.
8. Relevance, Significance, and Availability
SpeechCMMLU establishes a reproducible and comprehensive reference for benchmarking Chinese spoken knowledge QA and reasoning. Its rigorous TTS and ASR-based filtering pipeline ensures the absence of spurious speech errors, enabling transparent, fair, and efficient evaluation of audio models' high-level language understanding in Chinese. SpeechCMMLU is integrated into the UltraEval-Audio benchmarking framework, offering one-command evaluation, public leaderboards, and code access at https://github.com/OpenBMB/UltraEval-Audio (Shi et al., 4 Jan 2026). Its release is intended to facilitate sustained progress in the evaluation and development of both open- and closed-source audio foundation models in Chinese.