SpeechEval Dataset Benchmark
- SpeechEval is a multilingual dataset comprising 32,207 speech clips with 128,754 annotations spanning quality assessment, pairwise comparison, improvement suggestions, and deepfake detection.
- The dataset employs a structured human and LLM review process using detailed 5-point perceptual scales and rigorous statistical agreement metrics to ensure reliable evaluations.
- Its comprehensive and balanced design across four languages makes it a crucial resource for training and benchmarking speech-quality-aware large language models.
SpeechEval is a large-scale, multilingual dataset designed to provide a unified, fine-grained, and explainable benchmark for evaluating the perceptual quality and authenticity of speech. Developed as part of the "SpeechLLM-as-Judges" paradigm, SpeechEval enables both algorithmic and human-in-the-loop assessment for four key tasks: speech quality assessment, pairwise comparison, automatic improvement suggestion, and synthetic (deepfake) speech detection. Its comprehensive construction—spanning 32,207 utterances and 128,754 annotations across English, Mandarin Chinese, Japanese, and French—makes SpeechEval a critical resource for training and benchmarking speech-quality-aware LLMs (Wang et al., 16 Oct 2025).
1. Dataset Composition and Multilingual Coverage
SpeechEval comprises 32,207 distinct speech clips, balanced in distribution across four major languages: English (~40%), Chinese (~35%), Japanese (~12.5%), and French (~12.5%). Each utterance is subject to multiple annotation tasks, yielding a total of 128,754 labels. Both real and synthetic speech are represented in roughly equal proportion, with real utterances sourced from large public corpora and synthetic audio generated using diverse open-source and commercial text-to-speech (TTS) engines. Attention is paid to balancing male/female speakers, affective content (approximately 70% neutral emotion, 30% emotional), speaker characteristics, and recording conditions, including various forms of ambient noise or distortion.
Key sources include Common Voice, Librispeech, Voice Conversion Challenge data, Blizzard, BVCC, AISHELL-3, JSUT, JVS, SIWIS, DataBaker, and synthetic utterances generated using Transformer TTS, VITS, Tacotron2, FastSpeech2, open- and closed-source neural codec LMs, and major commercial providers. Metadata fields annotate gender, emotion type, distortion type, and open-ended perceptual comments for each sample.
2. Annotation Protocols and Task Taxonomy
SpeechEval supports four annotation tasks, each targeted at a distinct aspect of speech quality and authenticity:
- Speech Quality Assessment (SQA): Each utterance is rated on eight perceptual scales (Overall Quality, Intelligibility, Distortion, Speech Rate, Dynamic Range, Emotional Impact, Artistic Expression, Subjective Experience), all on standardized 5-point ranges; additional metadata and free-form perceptual descriptions are provided.
- Speech Quality Comparison (SQC): Human annotators compare pairs of utterances across all eight dimensions and overall quality, recording their preference and producing justifying explanations.
- Speech Quality Improvement Suggestion (SQI): Given a single utterance, annotators generate concrete, actionable improvement directives (e.g., suggesting denoising or emphasizing certain phonemes) to enhance perceived quality.
- Deepfake Speech Detection (DSD): Utterances are labeled as "Real" or "Fake" (synthetic), based on source provenance.
The annotation pipeline combines human raters with LLM-generated drafts and iterative revision to increase annotation efficiency and reliability. Structured criteria for each perceptual dimension are documented, ensuring consistent interpretation.
3. Data Splits, Statistical Properties, and Agreement Metrics
Each task in SpeechEval is divided into standardized train, validation, and test splits, as shown below:
| Task | Train | Val | Test | Total |
|---|---|---|---|---|
| SQA | 23,769 | 5,392 | 5,505 | 34,666 |
| SQC | 19,260 | 4,069 | 4,297 | 27,626 |
| SQI | 23,494 | 5,316 | 5,445 | 34,255 |
| DSD | 6,600 | 5,724 | 19,883 | 32,207 |
| Sum | 73,123 | 20,501 | 35,130 | 128,754 |
The 1–5 scales in SQA are broadly used across samples, with near-uniform distributions for most dimensions, and moderate skew observed for Speech Rate. Demographic and emotion-attribute distributions indicate balanced gender and a predominance of the "neutral" affect label.
Inter-annotator agreement was measured on a 1,000-utterance SQA subset using both categorical and ordinal statistics:
- Cohen’s κ: ≈0.78 (Intelligibility), ≈0.72 (Distortion)
- Krippendorff’s α: ≈0.81 (eight dimensions, ordinal average)
- Kendall’s W: ≈0.84 (SQC rank concordance)
These values indicate substantial or better agreement for all major dimensions.
4. Evaluation Metrics and Benchmarking
Task-specific evaluation metrics in SpeechEval are rigorously defined:
- Regression (SQA):
- Mean Squared Error
- Pearson Correlation
- Classification (DSD):
- Accuracy
- Precision , Recall , F1-score , Equal Error Rate (EER), and minDCF (ASVspoof convention)
- Pairwise Comparison (SQC):
- Pairwise Accuracy
- Text Generation (SQA/SQC/SQI descriptions, suggestions, and justifications):
- BLEU-4, METEOR, ROUGE-L, CIDEr-D, SBERT-Similarity, FENSE, and an "LLM Score" based on a weighted sum of human-aligned criteria.
Strict usage protocols require models to evaluate only on held-out test splits and report metrics according to the provided normalization and scale conventions.
5. Construction Pipeline and Quality Control
SpeechEval annotation is conducted via a structured questionnaire, automated by LLMs and further curated by expert raters. For DSD, authenticity labels are algorithmically determined based on known origin. For SQA and SQC, annotators follow documented 5-point protocols for each perceptual axis and provide open-ended textual feedback.
Quality control involves:
- Pairwise revision (LLM + human-in-the-loop)
- Cross-rater double annotation for agreement estimation
- Consistency checks using control samples and predefined guidelines for each dimension and task
Metadata and open comment fields are included to support granular error analysis and the development of interpretable LLM-based judge models.
6. Access, Licensing, and Usage Constraints
SpeechEval is released under a CC-BY-4.0 license, with the exception of data segments originating from third-party corpora (e.g., Common Voice, Librispeech), which retain their original licenses. Users must provide attribution to the authors—"H. Wang et al., ‘SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation,’ Proc. ACL 2025"—and comply with original data license terms for all incorporated subsets. The full dataset, annotation scripts, and associated code are accessible via GitHub (https://github.com/Microsoft/SpeechEval) and HuggingFace Datasets (Wang et al., 16 Oct 2025).
7. Significance and Applications
SpeechEval establishes a comprehensive, high-resolution benchmark for evaluating models that capture nuanced, human-aligned judgments of speech quality and authenticity. It directly supports research on LLM-as-judge paradigms, robust perceptual model training, cross-language evaluation, and speech deepfake detection, and sets a new standard for interpretable, explainable benchmarking in both synthesized and human-generated speech. The dataset fills a previously unsatisfied need for multilingual, task-diverse, and highly annotated perceptual benchmarks in speech technology research (Wang et al., 16 Oct 2025).