SpeechJudge-Eval: Speech Naturalness Benchmark
- SpeechJudge-Eval is a benchmark that assesses speech naturalness through pairwise comparisons of synthesized outputs using high-confidence human labels.
- It leverages a diverse annotated corpus from multiple TTS systems to ensure robust evaluation across regular and expressive speech.
- The benchmark serves as a foundation for aligning automated metrics and training reward models to better reflect human judgments of naturalness.
SpeechJudge-Eval is the benchmark component of SpeechJudge, designed for pairwise speech naturalness judgment: given a target text and two synthesized speech outputs, the evaluator must decide which output is more natural or human-like. It was introduced to address a gap in speech synthesis evaluation: existing speech quality reporting is fragmented across papers, and proxy measures such as WER, FAD, MOS predictors, and deepfake detectors do not necessarily align with human preference for naturalness. The benchmark is built from high-agreement human labels derived from a larger human-feedback corpus, and its initial evaluation shows that existing automatic metrics and AudioLLMs remain substantially below the benchmark’s intended target, with Gemini-2.5-Flash as the leading baseline at 69.1% agreement and still below 70% (Zhang et al., 11 Nov 2025).
1. Scope and motivation
SpeechJudge-Eval is centered on naturalness, which the underlying paper treats as one of the most fundamental subjective criteria in text-to-speech (TTS). Its specific task is not MOS prediction, deepfake detection, or transcript fidelity estimation; rather, it is a comparative preference judgment over two speech samples conditioned on the same target text. This framing places the benchmark close to preference-modeling paradigms used in alignment, but adapted to speech rather than text (Zhang et al., 11 Nov 2025).
The benchmark was created because the speech field lacked a large-scale human preference corpus focused specifically on naturalness. The paper argues that automatic evaluators and strong AudioLLMs still do not reliably match human judgments of which speech sounds more natural. SpeechJudge-Eval is therefore intended as a controlled evaluation substrate for testing whether a model can reproduce human comparative judgments on speech naturalness, rather than merely optimize surface proxies such as intelligibility or lexical correctness (Zhang et al., 11 Nov 2025).
A central design choice is the use of high-agreement human labels. The benchmark is intentionally filtered to retain only confident naturalness comparisons, so that disagreement in automated systems can be more plausibly attributed to evaluator limitations than to annotation ambiguity. This makes SpeechJudge-Eval not only a benchmark for model comparison, but also a reference point for studying the gap between human and automated judgment in speech evaluation (Zhang et al., 11 Nov 2025).
2. SpeechJudge-Data as the benchmark substrate
SpeechJudge-Eval is derived from SpeechJudge-Data, a human-feedback corpus of 99K annotated speech pairs with target texts, synthesized from multiple strong zero-shot TTS systems and annotated for both intelligibility and naturalness. The corpus uses six TTS models spanning three architecture families: AR-based (ARS, CosyVoice2, CosyVoice2-INTP, Ints-INTP), FM-based (F5-TTS), and MGM-based (MaskGCT). This construction is explicitly intended to create diverse speech quality profiles and reduce overfitting to a single generator family (Zhang et al., 11 Nov 2025).
Reference speech includes both regular speech, sampled from Emilia-Large, and expressive speech, sourced from ParaSpeechCaps, L2-Arctic, KeSpeech, in-house whisper samples, and Genshin Impact character voices. Target texts cover Chinese (zh), English (en), and Chinese-English code-switching (mixed), with monolingual settings (en2en, zh2zh) and cross-lingual settings (zh2en, en2zh, zh2mixed, en2mixed). Pair formation includes both intra-model pairs and inter-model pairs, broadening the space of naturalness contrasts represented in the data (Zhang et al., 11 Nov 2025).
Each sample receives two annotation types. Intelligibility is a binary judgment about whether each audio correctly reads the target text without insertion, omission, or mispronunciation. Naturalness uses a five-level Comparative Mean Opinion Score (CMOS) pairwise preference scale: A +2, A +1, Tie, B +1, and B +2. The annotation guidelines specify that naturalness should reflect human-likeness, prosody, pacing, clarity, and stress, and that minor isolated pronunciation errors should not dominate the naturalness decision unless they seriously disrupt listening (Zhang et al., 11 Nov 2025).
The annotation process involved 69 annotators over 2 months, yielding 99K raw pairs, an average of 2.49 annotations per pair, and a reported annotation value of over 500K RMB. For agreement analysis, the five-level CMOS labels are collapsed to A / B / Tie, and four agreement levels are defined: FA (full agreement), WA (weak agreement), WD (weak disagreement), and FD (full disagreement). The paper reports that about 70% of the data is in FA or WA, and that the expressive subset is harder, with lower agreement than the regular subset (Zhang et al., 11 Nov 2025).
3. Benchmark construction and task protocol
SpeechJudge-Eval formalizes speech naturalness judgment as a binary pairwise comparison: given a target text and two audio samples , the model predicts which audio is more natural. The paper defines benchmark accuracy as the fraction of examples for which model and human preferences match. This metric directly operationalizes alignment with comparative human judgment, rather than agreement with a scalar quality proxy (Zhang et al., 11 Nov 2025).
The benchmark is constructed from SpeechJudge-Data (hq) through a strict filtering pipeline. First, all Tie samples are removed. Second, only Full Agreement (FA) samples are retained. Third, the resulting pool is stratified over regular vs expressive subsets and over target languages zh / en / mixed. The final benchmark contains 1,000 pairs, designed to preserve both label confidence and coverage across speech styles and language conditions (Zhang et al., 11 Nov 2025).
| Benchmark component | Composition | Size |
|---|---|---|
| Regular / Emilia-Large | 200 en + 200 zh | 400 |
| Expressive / ParaSpeechCaps, L2-Arctic, KeSpeech, Genshin, etc. | 200 en + 200 zh + 200 mixed | 600 |
| Total | FA-only, no ties, stratified sampling | 1,000 |
This construction has two methodological consequences. First, it turns SpeechJudge-Eval into a high-confidence benchmark for relative naturalness rather than absolute scoring. Second, it makes the benchmark sensitive to phenomena that are often underrepresented in transcript-centric evaluation, especially expressive variation and multilingual or code-switched synthesis. The paper’s repeated contrast between regular and expressive speech indicates that benchmark difficulty is not uniform across subsets (Zhang et al., 11 Nov 2025).
4. Baselines and empirical behavior
The paper evaluates four broad classes of baselines on SpeechJudge-Eval: objective metrics, MOS predictors, deepfake detectors, and AudioLLMs. The overall pattern is that none of these families reaches human-level agreement, and several are only near chance for pairwise naturalness judgment (Zhang et al., 11 Nov 2025).
Among objective metrics, the reported total accuracies are WER: 57.9%, SIM: 44.5%, and FAD: 48.6%. These results are close to chance or only modestly above it, supporting the paper’s claim that naturalness is not reducible to transcript correctness, speaker similarity, or distributional audio distance. Among MOS predictors, the paper reports DNSMOS: 57.9%, UTMOS: 53.7%, and Audiobox-aesthetics variants with CE: 60.8%, CU: 57.3%, PC: 44.9%, and PQ: 57.1%; the best result in this group is CE at 60.8%, still well below the stronger pairwise evaluators (Zhang et al., 11 Nov 2025).
Deepfake detectors perform poorly: AASIST reaches 46.7%, and ADV reaches 38.3%. This is consistent with the paper’s observation that real-vs-fake discrimination is not the same problem as ranking two synthetic outputs by naturalness. The detector objective does not encode the comparative and fine-grained perceptual criteria that human annotators use in SpeechJudge-Eval (Zhang et al., 11 Nov 2025).
AudioLLMs fare better but still fall short of the benchmark’s apparent difficulty. Reported open-source results include Phi-4-Multimodal: 57.0%, Qwen2.5-Omni-7B: 60.6%, Kimi-Audio-7B-Instruct: 67.0%, Gemma-3n-E4B-it: 48.2%, Voxtral-Mini-3B-2507: 56.0%, MiDashengLM: 61.6%, and MiMo-Audio-7B-Instruct: 54.1%. Closed-source systems include Gemini-2.5-Flash: 69.1%, Gemini-2.5-Pro: 66.5%, GPT-4o mini Audio: 50.5%, and GPT-4o Audio: 67.4%. The headline result is that Gemini-2.5-Flash is the strongest baseline, but still remains below 70% agreement (Zhang et al., 11 Nov 2025).
The paper also reports prompt-sensitivity for some judge models. Chain-of-Thought prompting improves Gemini-2.5-Flash from 69.1% to 70.5%, while Kimi-Audio-7B-Instruct declines from 67.0% to 66.5%. This indicates that reasoning-style prompting does not uniformly help, and that performance depends on whether the underlying multimodal model already has strong instruction-following and evaluative capacity. The analysis further notes that expressive speech is harder than regular speech, reinforcing that benchmark difficulty is not explained by a single global score (Zhang et al., 11 Nov 2025).
5. SpeechJudge-GRM and benchmark-driven alignment
SpeechJudge-Eval is not only an evaluation set; within the SpeechJudge framework it also serves as the target benchmark for a dedicated generative reward model, SpeechJudge-GRM. This model is built on Qwen2.5-Omni-7B (Thinker). The paper argues for a generative reward model rather than a classic scalar Bradley-Terry reward model because a GRM can produce Chain-of-Thought rationales and supports test-time scaling / majority voting (Zhang et al., 11 Nov 2025).
Training proceeds in two stages. In Stage 1, the model is given a cold start via Supervised Fine-Tuning (SFT) on Chain-of-Thought rationales generated by Gemini-2.5-Flash. For each training sample , the teacher produces a rationale and a preference label, and only samples where the teacher label matches the human label are retained. This stage uses 25K teacher-agreeing samples, LoRA rank 128, and learning rate . In Stage 2, the model is optimized with GRPO on the hard cases where the teacher disagrees with humans; the appendix states that DAPO, an enhanced GRPO variant, is actually used. This stage uses 17K teacher-disagreeing samples, LoRA rank 64, 8 rollouts per prompt, batch size 32, and learning rate . The reward is verifiable against human preference:
1 |
r = 1 \;\text{if}\; y_{\mathcal{M}} = y_{\mathcal{H}}, \quad r=-1 \;\text{otherwise} |
The benchmark results show a staged improvement:
| Model | Total accuracy |
|---|---|
| Qwen2.5-Omni-7B | 60.6% |
| Gemini-2.5-Flash | 69.1% |
| SpeechJudge-BTRM | 72.7% |
| SpeechJudge-GRM (SFT) | 75.3% |
| SpeechJudge-GRM (SFT + RL) | 77.2% |
| SpeechJudge-GRM (SFT) w/ Voting@10 | 77.6% |
| SpeechJudge-GRM (SFT + RL) w/ Voting@10 | 79.4% |
These numbers establish two points. First, the dedicated benchmark supports meaningful separation between generic AudioLLMs and reward models trained specifically for speech naturalness. Second, the gain from SFT (60.6% → 75.3%) followed by RL (75.3% → 77.2%) suggests that benchmark-aligned post-training contributes substantially beyond the base multimodal model (Zhang et al., 11 Nov 2025).
The paper further reports that SpeechJudge-GRM can be used as a reward function during post-training of speech generation models. Two practical uses are described: Best-of-N sample selection and post-training TTS models with DPO/online alignment. According to the paper, these uses improve both intelligibility and naturalness, while naturalness gains do not materially harm speaker similarity, though similarity alignment remains imperfect and is identified as future work (Zhang et al., 11 Nov 2025).
6. Relation to adjacent speech and judge-evaluation research
Within speech evaluation, SpeechJudge-Eval occupies a narrower but more specialized role than benchmarks such as SD-Eval. SD-Eval evaluates spoken dialogue understanding beyond words in a single-turn speech-to-text dialogue setting, with four subsets—emotion, accent, age, and background sound / environment—and a dataset of 7,303 utterances totaling 8.76 hours. Its central question is whether a model can generate an appropriate response conditioned on paralinguistic and environmental cues, whereas SpeechJudge-Eval asks which of two speech outputs is more natural. SD-Eval also reports that GPT-4o-based LLM metrics correlate more strongly with human evaluation than BLEU/ROUGE-style metrics, which reinforces the broader claim that open-ended speech evaluation often requires richer judgment mechanisms than lexical overlap (Ao et al., 2024).
More generally, recent work on judge models indicates that evaluator quality is multidimensional rather than reducible to a single agreement number. Research on hidden shortcuts shows that LLM judges can change verdicts under irrelevant metadata perturbations while failing to acknowledge those cues in their rationales, yielding an explanation gap (Marioriyad et al., 8 Feb 2026). Work on evaluative fingerprints reports low inter-judge agreement but high judge-specific stability, arguing that judges are stable yet non-interchangeable measurement devices (Nasser, 8 Jan 2026). A psychometric Judge Datasheet protocol further decomposes judge behavior into dark current, stable cross-sensitivity, positional false preference, target sensitivity, and criterion shifts, explicitly treating judges as instruments rather than scalar classifiers (Usami et al., 14 Jun 2026).
Related literature also highlights the role of pairwise robustness and judge assignment. J4R argues that reasoning-heavy pairwise evaluation is strongly affected by positional bias, and introduces training over equivalent response-order states to improve judge consistency (Xu et al., 19 May 2025). CyclicJudge shows that judge identity can contribute substantial variance in LLM-as-judge pipelines and that round-robin judge assignment can cancel judge bias without increasing per-item cost (Zhu et al., 2 Mar 2026). Capability-oriented benchmarking work such as M-JudgeBench argues that reliable judge evaluation should separately probe pairwise comparison, length bias avoidance, and process error detection (Chen et al., 28 Feb 2026).
This broader literature suggests that SpeechJudge-Eval should be understood as measuring a specific evaluative construct—speech naturalness under pairwise human preference—rather than generic judge competence. Its contribution is therefore twofold: it supplies a high-confidence benchmark for one foundational speech criterion, and it provides a substrate on which broader judge-reliability questions in multimodal evaluation can be posed with unusually direct human supervision (Zhang et al., 11 Nov 2025).