SpeechLLM-as-Judges: LLM Speech Quality Evaluation
- SpeechLLM-as-Judges is a unified framework that leverages LLMs to provide multidimensional, explanation-rich evaluations of synthetic speech quality.
- It integrates a speech encoder with an LLM-based decoder, using chain-of-thought reasoning to output detailed scores and natural language justifications.
- This approach improves generalization and interpretability compared to traditional metrics like MOS, evidenced by superior correlation and detection performance.
The SpeechLLM-as-Judges paradigm defines a unified and explanation-based framework for the perceptual evaluation of synthetic speech, leveraging LLMs trained to output structured and interpretable judgments. This approach departs from traditional scalar or binary scoring methods by incorporating dimension-rich, rationale-driven feedback, facilitating generalization across languages and tasks. Enabled by the large-scale, multilingual SpeechEval dataset and realized through the speech-quality-aware LLM architecture SQ-LLM, this methodology advances both the granularity and adaptability of speech quality assessment in contemporary generative speech systems (Wang et al., 16 Oct 2025).
1. Motivation and Conceptual Foundation
Generative speech technologies—spanning text-to-speech synthesis, speech-to-speech translation, and dialogue generation—require robust evaluation of perceptual quality for both development and deployment. Conventional metrics such as Mean Opinion Score (MOS), AB tests, and automatic measures (e.g., Mel Cepstral Distortion) tend to offer scalar or binary outcomes devoid of explanatory depth and generalization ability across languages or attributes.
The SpeechLLM-as-Judges paradigm reframes evaluation as an instruction-following, multidimensional, and explanation-producing task. LLMs enriched with chain-of-thought (CoT) reasoning are trained to deliver not only quantitative scores but also rationales spanning high-level quality dimensions and granular perceptual factors. This approach is designed to overcome the interpretability deficit and generalization bottlenecks of standard methods, extending applicability across multilingually diverse real-world scenarios.
2. SpeechEval Dataset
The SpeechEval dataset underpins the training and benchmarking of SpeechLLM-as-Judges:
- Comprises 32,207 unique speech utterances and 128,754 total annotations.
- Supports four languages: Chinese, English, Japanese, and French.
- Each sample is annotated via structured fields and natural language explanations, capturing dimensions such as Overall Quality, Intelligibility, Distortion (including timbre and artifact categorization), Speech Rate, Dynamic Range, Emotion, Artistic Expression, and Subjective Experience.
- Categorical metadata includes distortion type, emotion label, and gender, supplemented with free-form descriptions.
SpeechEval spans four principal human annotation tasks:
- Speech Quality Assessment: Raters provide both numerical quality judgments and natural language justifications at global and dimension-specific levels.
- Pairwise Comparison: Annotators compare utterances and justify their preference, identifying key perceptual differences.
- Improvement Suggestion: Raters generate actionable feedback for suboptimal utterances.
- Deepfake Detection: Each utterance is labeled as bona fide or manipulated, with supporting rationale.
This dataset allows for comprehensive supervision and benchmarking of the unified model across objective signal attributes and subjective perceptual properties.
3. SQ-LLM: Model Architecture and Training
Architecture:
- The SQ-LLM model integrates a speech encoder (for acoustic feature extraction) with a speech-aware language decoder, built atop an omni-modal LLM backbone (e.g., Qwen2.5-Omni).
- The model handles task instructions alongside either single utterances or paired samples, outputting structured responses including per-dimension scores and textual rationales.
Training regime:
- Stage I: Instruction Tuning with Chain-of-Thought Reasoning. The model is trained to output stepwise judgments across perceptual dimensions, followed by a holistic natural language answer. Formally, the joint training loss is:
where is the cross-entropy loss for each dimension, and is for the explanation.
- Stage II: Reward Optimization via Generalized Policy Gradient Optimization (GRPO). The trained model is further optimized with reward signals defined for key aspects (Helpfulness, Relevance, Accuracy, Detail) across task targets. Per-dimension rewards are structured as:
and total reward is a weighted sum over all dimensions.
The approach enables the model to generate human-like, interpretable rationales consistent with technically grounded speech perception.
4. Evaluation Tasks and Experimental Results
The SpeechLLM-as-Judges framework directly supports four evaluation task types:
A. Speech Quality Assessment
- SQ-LLM outputs dimension-wise scores and explanations for single utterances.
- Achieves average Pearson correlation coefficient (PCC) 0.476 with human ratings across eight perceptual dimensions, outperforming baseline and non-task-specific models.
B. Speech Quality Comparison
- The model produces relative and dimension-specific rationales to select a preferred utterance between two samples.
- Delivers average comparison accuracy 67.2%, surpassing general-purpose multimodal LLMs without speech-specific instruction tuning.
C. Speech Quality Improvement Suggestion
- SQ-LLM provides actionable suggestions (e.g., “reduce background noise in the first 0–0.5s segment”, “increase speaking rate”) to optimize utterances.
- Metrics such as BLEU4 (0.255), ROUGE-L, and CIDER-D exhibit consistently superior performance relative to baseline generative models on reference-based text similarity and LLM-based response quality.
D. Deepfake Speech Detection
- The model classifies utterances as bona fide or synthetic/manipulated using both acoustic and textual cues.
- Attains an Equal Error Rate (EER) of 6.25%, minDCF of 0.142, and accuracy 89%, outperforming untuned LLMs and handcrafted feature-based baselines.
The open instruction-following setup ensures robust generalization, with cross-language and cross-task consistency confirmed in ablation and test splits.
5. Interpretability and Generalization
SpeechLLM-as-Judges centers interpretability by exposing intermediate per-dimension judgments, rationales, and explicit error analysis in the evaluation trace. This stands in contrast to traditional systems that only return abstract scores, enabling in-depth understanding of which acoustic-perceptual factors (e.g., intelligibility, distortion, emotion) drive the final assessment.
Generalization is ensured through:
- Unified, cross-task instruction format.
- Multi-stage training with chain-of-thought reasoning, supporting explanation-rich outputs across languages and prompt types.
- Empirical evidence that training on the SpeechEval dataset enables robust performance for Chinese, English, Japanese, and French (four representative languages) and across technical and perceptually subjective tasks.
6. Open-Source Resources and Future Prospects
Relevant resources are scheduled for open release:
- SpeechEval: 32,207 audio clips, 128,754 annotations, full evaluation protocols and metadata.
- SQ-LLM implementation: instruction tuning and reward optimization code, chain-of-thought prompting schemas, and evaluation pipeline scripts.
- Prompt templates: Structured instruction sets for general and task-specific queries.
Release venues are anticipated to include GitHub and/or dedicated project portals, offering the community a reproducible and extensible evaluation ecosystem.
Future directions may involve:
- Extending the language and attribute diversity of the SpeechEval dataset.
- Refining dimension coverage and reward signaling for even finer-grained quality assessment.
- Adapting the paradigm to new modalities (e.g., expressive audiovisual synthesis) or cross-domain judgment tasks.
7. Significance and Limitations
SpeechLLM-as-Judges establishes a new methodological direction in speech assessment by integrating LLMs as explanation-based evaluators that combine objective signal analysis with subjective reasoning in a structured, transparent manner. The framework addresses current limitations of interpretability and generalization in speech quality scoring, demonstrating measurable performance gains across diverse tasks and languages. However, the ultimate utility of such models depends on open-access training resources and continuous validation in high-variance, real-user scenarios.
This paradigm provides a foundation for speech quality evaluation pipelines that require detailed feedback, scalable annotation, and alignment with human perceptual judgments, reflecting a significant advancement in both methodology and resource availability for speech technology research (Wang et al., 16 Oct 2025).