SpeechEval Dataset Benchmark

Updated 18 February 2026

SpeechEval is a multilingual dataset comprising 32,207 speech clips with 128,754 annotations spanning quality assessment, pairwise comparison, improvement suggestions, and deepfake detection.
The dataset employs a structured human and LLM review process using detailed 5-point perceptual scales and rigorous statistical agreement metrics to ensure reliable evaluations.
Its comprehensive and balanced design across four languages makes it a crucial resource for training and benchmarking speech-quality-aware large language models.

SpeechEval is a large-scale, multilingual dataset designed to provide a unified, fine-grained, and explainable benchmark for evaluating the perceptual quality and authenticity of speech. Developed as part of the "SpeechLLM-as-Judges" paradigm, SpeechEval enables both algorithmic and human-in-the-loop assessment for four key tasks: speech quality assessment, pairwise comparison, automatic improvement suggestion, and synthetic (deepfake) speech detection. Its comprehensive construction—spanning 32,207 utterances and 128,754 annotations across English, Mandarin Chinese, Japanese, and French—makes SpeechEval a critical resource for training and benchmarking speech-quality-aware LLMs (Wang et al., 16 Oct 2025).

1. Dataset Composition and Multilingual Coverage

SpeechEval comprises 32,207 distinct speech clips, balanced in distribution across four major languages: English (~40%), Chinese (~35%), Japanese (~12.5%), and French (~12.5%). Each utterance is subject to multiple annotation tasks, yielding a total of 128,754 labels. Both real and synthetic speech are represented in roughly equal proportion, with real utterances sourced from large public corpora and synthetic audio generated using diverse open-source and commercial text-to-speech (TTS) engines. Attention is paid to balancing male/female speakers, affective content (approximately 70% neutral emotion, 30% emotional), speaker characteristics, and recording conditions, including various forms of ambient noise or distortion.

Key sources include Common Voice, Librispeech, Voice Conversion Challenge data, Blizzard, BVCC, AISHELL-3, JSUT, JVS, SIWIS, DataBaker, and synthetic utterances generated using Transformer TTS, VITS, Tacotron2, FastSpeech2, open- and closed-source neural codec LMs, and major commercial providers. Metadata fields annotate gender, emotion type, distortion type, and open-ended perceptual comments for each sample.

2. Annotation Protocols and Task Taxonomy

SpeechEval supports four annotation tasks, each targeted at a distinct aspect of speech quality and authenticity:

Speech Quality Assessment (SQA): Each utterance is rated on eight perceptual scales (Overall Quality, Intelligibility, Distortion, Speech Rate, Dynamic Range, Emotional Impact, Artistic Expression, Subjective Experience), all on standardized 5-point ranges; additional metadata and free-form perceptual descriptions are provided.
Speech Quality Comparison (SQC): Human annotators compare pairs of utterances across all eight dimensions and overall quality, recording their preference and producing justifying explanations.
Speech Quality Improvement Suggestion (SQI): Given a single utterance, annotators generate concrete, actionable improvement directives (e.g., suggesting denoising or emphasizing certain phonemes) to enhance perceived quality.
Deepfake Speech Detection (DSD): Utterances are labeled as "Real" or "Fake" (synthetic), based on source provenance.

The annotation pipeline combines human raters with LLM-generated drafts and iterative revision to increase annotation efficiency and reliability. Structured criteria for each perceptual dimension are documented, ensuring consistent interpretation.

3. Data Splits, Statistical Properties, and Agreement Metrics

Each task in SpeechEval is divided into standardized train, validation, and test splits, as shown below:

Task	Train	Val	Test	Total
SQA	23,769	5,392	5,505	34,666
SQC	19,260	4,069	4,297	27,626
SQI	23,494	5,316	5,445	34,255
DSD	6,600	5,724	19,883	32,207
Sum	73,123	20,501	35,130	128,754

The 1–5 scales in SQA are broadly used across samples, with near-uniform distributions for most dimensions, and moderate skew observed for Speech Rate. Demographic and emotion-attribute distributions indicate balanced gender and a predominance of the "neutral" affect label.

Inter-annotator agreement was measured on a 1,000-utterance SQA subset using both categorical and ordinal statistics:

Cohen’s κ: ≈0.78 (Intelligibility), ≈0.72 (Distortion)
Krippendorff’s α: ≈0.81 (eight dimensions, ordinal average)
Kendall’s W: ≈0.84 (SQC rank concordance)

These values indicate substantial or better agreement for all major dimensions.

4. Evaluation Metrics and Benchmarking

Task-specific evaluation metrics in SpeechEval are rigorously defined:

Regression (SQA):
- Mean Squared Error $\mathrm{MSE} = \frac{1}{N}\sum_{i=1}^N (y_i - \hat y_i)^2$
- Pearson Correlation $r$
Classification (DSD):
- Accuracy $\mathrm{Acc} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}$
- Precision $P$ , Recall $R$ , F1-score $F_1$ , Equal Error Rate (EER), and minDCF (ASVspoof convention)
Pairwise Comparison (SQC):
- Pairwise Accuracy $\mathrm{Acc}_\mathrm{pair} = \frac{\#\text{correct comparisons}}{\#\text{total comparisons}}$
Text Generation (SQA/SQC/SQI descriptions, suggestions, and justifications):
- BLEU-4, METEOR, ROUGE-L, CIDEr-D, SBERT-Similarity, FENSE, and an "LLM Score" based on a weighted sum of human-aligned criteria.

Strict usage protocols require models to evaluate only on held-out test splits and report metrics according to the provided normalization and scale conventions.

5. Construction Pipeline and Quality Control

SpeechEval annotation is conducted via a structured questionnaire, automated by LLMs and further curated by expert raters. For DSD, authenticity labels are algorithmically determined based on known origin. For SQA and SQC, annotators follow documented 5-point protocols for each perceptual axis and provide open-ended textual feedback.

Quality control involves:

Pairwise revision (LLM + human-in-the-loop)
Cross-rater double annotation for agreement estimation
Consistency checks using control samples and predefined guidelines for each dimension and task

Metadata and open comment fields are included to support granular error analysis and the development of interpretable LLM-based judge models.

6. Access, Licensing, and Usage Constraints

SpeechEval is released under a CC-BY-4.0 license, with the exception of data segments originating from third-party corpora (e.g., Common Voice, Librispeech), which retain their original licenses. Users must provide attribution to the authors—"H. Wang et al., ‘SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation,’ Proc. ACL 2025"—and comply with original data license terms for all incorporated subsets. The full dataset, annotation scripts, and associated code are accessible via GitHub (https://github.com/Microsoft/SpeechEval) and HuggingFace Datasets (Wang et al., 16 Oct 2025).

7. Significance and Applications

SpeechEval establishes a comprehensive, high-resolution benchmark for evaluating models that capture nuanced, human-aligned judgments of speech quality and authenticity. It directly supports research on LLM-as-judge paradigms, robust perceptual model training, cross-language evaluation, and speech deepfake detection, and sets a new standard for interpretable, explainable benchmarking in both synthesized and human-generated speech. The dataset fills a previously unsatisfied need for multilingual, task-diverse, and highly annotated perceptual benchmarks in speech technology research (Wang et al., 16 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpeechEval Dataset.