SpeechCMMLU: Chinese Audio QA Benchmark

Updated 7 January 2026

The paper introduces SpeechCMMLU, an evaluation benchmark for Chinese spoken question answering that leverages TTS-based synthesis and zero-error ASR filtering for high-quality audio prompts.
It converts text-based CMMLU questions into comprehensive spoken prompts, enabling rigorous assessment of audio foundation models across diverse subject areas.
Baseline results reveal varied model performance, highlighting the benchmark’s potential to drive advances in Chinese audio processing and reasoning.

SpeechCMMLU is an evaluation benchmark introduced to fill the critical gap in large-scale, high-level Chinese-language spoken question answering resources, paralleling the established CMMLU (Chinese Massive Multi-task Language Understanding) benchmark in the speech modality. Developed as part of the UltraEval-Audio suite, SpeechCMMLU facilitates rigorous assessment of audio foundation models’ Chinese world-knowledge comprehension and reasoning on spoken multiple-choice questions, enabling reproducible and efficient cross-model comparisons in audio-based QA for the Chinese language (Shi et al., 4 Jan 2026).

1. Motivation and Benchmark Design

SpeechCMMLU was developed to address three primary deficits in the landscape of audio model evaluation. First, existing QA-style speech benchmarks, such as SpeechTriviaQA, SpeechWebQuestions, and SpeechAlpacaEval, are exclusively English-based, precluding comprehensive evaluation of Chinese language understanding in audio models. Second, there was no significant established large-scale benchmark for high-level spoken QA tasks in Chinese. Third, the reproducibility and quality control of audio-based benchmarks required a robust, standardized construction methodology.

The benchmark distinguishes itself in several ways:

Input Modality: Each input is provided as a raw speech audio prompt rather than as text.
Language: All questions and options are in Chinese.
Task Format: Each item is a multiple-choice question with four answer options (A–D), all rendered in a single audio prompt. Models must generate their answer as either a spoken letter or its textual representation.
Reproducibility: All samples are synthesized using TTS and undergo strict automated validation to minimize errors and variability.

Compared to the original text-based CMMLU, SpeechCMMLU evaluates the full auditory processing pipeline, mapping from speech comprehension to knowledge reasoning and spoken/textual answer generation.

2. Dataset Composition and Coverage

SpeechCMMLU comprises 3,519 four-option, multiple-choice spoken QA items. These were selected from the original 11,583 CMMLU questions following a two-stage process: TTS-based audio construction and stringent quality filtering.

Domain Coverage: The benchmark covers the same diverse range of academic and professional subjects as CMMLU, including mathematics, computer science, history, law, medicine, chemistry, physics, art, philosophy, economics, politics, geography, among others. The precise number of categories and their individual distributions are not specified.
Sample Distribution: All retained samples satisfy a zero Character Error Rate (CER) filter on ASR transcription; no further label balancing or post hoc stratification is applied.
Data Split: SpeechCMMLU is designed solely as an evaluation set—no training, development, or test splits are provided.
Item Structure: Each sample consists of a compound audio prompt embedding both the question stem and the four answer options, synthesized as a contiguous waveform.

3. Audio Generation and Preprocessing Pipeline

Audio prompts are constructed through the following workflow:

Each question-plus-options prompt is synthesized using CosyVoice2, a high-fidelity Chinese TTS engine, ensuring proper pronunciation and fidelity for specialized terminology.
The resulting waveform is transcribed with a robust Mandarin ASR model (Paraformer-zh).
Only those samples with Character Error Rate equal to 0%—where the ASR transcript exactly matches the intended text—are retained. Any samples with transcription discrepancies are discarded.

This strict pipeline ensures that the final corpus contains only audio prompts with no TTS or pronunciation anomalies. While the standard audio parameters (e.g., 16 kHz WAV) are presumed, explicit details on format and duration are not documented.

4. Annotation, Ground Truth, and Quality Assurance

Ground truth correctness for each item is inherited directly from the CMMLU text annotations, with each sample specifying one correct answer (A, B, C, or D).

Annotation Source: Correct answers are not re-annotated or revised for the speech version.
Quality Control: The CER=0% criterion acts as the sole quality filter, ensuring that all audio prompts exactly correspond to the intended question and answer options. No further checks (such as human annotation consistency or perceptual validation) are specified.
Assumption: The benchmark assumes the gold status of the original CMMLU annotations and relies on automated processes for audio validation.

5. Evaluation Criteria and Scoring

The evaluation metric for SpeechCMMLU is classification accuracy, defined as the proportion of items for which the model's predicted option matches the ground-truth label:

$\mathrm{Accuracy} = \frac{N_{\mathrm{correct}}}{N_{\mathrm{total}}}$

where $N_{\mathrm{correct}}$ is the number of correct predictions and $N_{\mathrm{total}} = 3519$ is the total benchmark size. No composite, weighted, or per-domain measures are applied in this benchmark.

6. Baseline Results and Cross-Model Comparisons

Table 6 of (Shi et al., 4 Jan 2026) reports SpeechCMMLU performance across six representative audio foundation models, reflecting varying approaches and degrees of openness. The following results are provided in terms of raw accuracy:

Model	SpeechCMMLU Acc. (↑)
GPT-4o-Realtime	70.05 %
Qwen3-Omni-30B-A3B-Instruct	47.83 %
Qwen2.5-Omni	73.72 %
MiniCPM-o 2.6	51.37 %
Kimi-Audio-7B-Instruct	71.25 %
GLM-4-Voice	52.61 %

The highest-performing models (GPT-4o-Realtime, Qwen2.5-Omni, Kimi-Audio-7B-Instruct) achieve accuracy rates in the 70–74% range, while other open-source and proprietary models show more modest outcomes (≈48–52%). The spread in accuracy indicates continued headroom for improvement in Chinese spoken QA. The authors observe that open-source models have narrowed the gap with proprietary systems for Chinese world-knowledge reasoning, though accuracy remains below human-level performance.

7. Task Design and Sample Illustration

A typical SpeechCMMLU sample consists of a single audio file with both the question stem and four options read aloud in sequence. The model receives this audio prompt and produces its answer—the letter corresponding to the chosen option—as either speech or text.

Adapted examples (not verbatim from the benchmark, but constructed following its principles) include:

History Example:

Audio Prompt: “选择下列四项中，第一次鸦片战争爆发的时间最接近哪一年？ A. 1836年；B. 1839年；C. 1842年；D. 1856年。” Correct answer: B

Chemistry Example:

Audio Prompt: “已知下列元素的原子序数：碳为6，氧为8，氮为7，氖为10。哪一种元素的原子序数最大？A. 碳；B. 氧；C. 氮；D. 氖。” Correct answer: D

During evaluation, the model's response is compared to the annotated answer, and aggregate accuracy is reported.

8. Relevance, Significance, and Availability

SpeechCMMLU establishes a reproducible and comprehensive reference for benchmarking Chinese spoken knowledge QA and reasoning. Its rigorous TTS and ASR-based filtering pipeline ensures the absence of spurious speech errors, enabling transparent, fair, and efficient evaluation of audio models' high-level language understanding in Chinese. SpeechCMMLU is integrated into the UltraEval-Audio benchmarking framework, offering one-command evaluation, public leaderboards, and code access at https://github.com/OpenBMB/UltraEval-Audio (Shi et al., 4 Jan 2026). Its release is intended to facilitate sustained progress in the evaluation and development of both open- and closed-source audio foundation models in Chinese.

Markdown Report Issue Upgrade to Chat

References (1)

UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpeechCMMLU.