TTSDS2: Robust TTS Quality Metric
- TTSDS2 is an objective, distributional metric that assesses TTS quality by comparing perceptual feature distributions from synthetic, real, and noise speech.
- It overcomes limitations of traditional signal-based and MOS-prediction methods, enabling scalable, reference-free evaluation across multiple languages and domains.
- TTSDS2 employs the 2-Wasserstein distance on features like SSL embeddings, speaker ID, prosody, and ASR activations, showing strong correlation with human judgments.
Text-to-Speech Distribution Score 2 (TTSDS2) is an objective, non-intrusive, distributional metric for robustly quantifying the human-perceived quality of Text-to-Speech (TTS) systems across multiple domains and languages. Developed to address limitations in prior metrics—such as the inadequacy of signal-based or WER/CER metrics for modern neural TTS and the in-domain bias or collapse of MOS-prediction networks—TTSDS2 models the similarity between the empirical distributions of multiple perceptual features extracted from synthetic speech and from genuine speech, normalized against various forms of noise. This scalar metric in demonstrates high correlation with human judgments (MOS, CMOS, SMOS) across diverse settings and supports a multilingual, continually refreshed open benchmark (Minixhofer et al., 24 Jun 2025).
1. Motivation and Conceptual Basis
TTSDS2 was introduced to overcome two key limitations observed in the evaluation of recent, highly naturalistic TTS outputs. First, classic signal-matching or character/word error rate (WER/CER) measures fail when outputs are highly natural, potentially exhibiting "oversmoothing", speaker-mismatch, or not aligning directly with a ground-truth reference. Second, learned MOS/CMOS predictors (e.g., MOSNet, UTMOS) display strong drops in out-of-domain correlation and are commonly only trained in-domain. TTSDS2's distributional design directly compares the statistical behavior of multiple perceptual feature spaces—covering generic SSL embeddings, speaker ID, prosodic characteristics, and ASR-derived intelligibility—between synthetic, real, and noise utterance corpora. This framework is reference-free for content, scalable to new domains and languages, and non-intrusive in the sense that it does not require parallel reference text.
2. Formal Mathematical Framework
TTSDS2 proceeds by quantifying, for each perceptual feature , how much the distribution of in TTS-generated utterances () resembles that of real speech () as opposed to distractor (noise) distributions (). For each , the empirical distributions over real, synthetic, and noise datasets are , , and , respectively.
Similarity is assessed via the 2-Wasserstein distance, . For multivariate Gaussian cases:
where are the mean and covariance. For 1-D sorted samples:
For a feature :
The normalized similarity score is
This yields when synthetic matches real, when synthetic matches noise, and indicates "closer to real" than to noise. Features are grouped by factor—Generic (SSL embeddings), Speaker (ID embeddings), Prosody (F0/rate), and Intelligibility (ASR activations). Each factor’s score is the unweighted average over its group, and the TTSDS2 is the global average:
3. Feature Extraction and Computation Pipeline
Computation of TTSDS2 for a given TTS system involves five high-level steps:
- Feature Extraction:
- Generic: WavLM, HuBERT/mHuBERT-147, wav2vec 2.0, XLSR-53 embeddings
- Speaker: d-Vector, WeSpeaker embeddings
- Prosody: PyWORLD F0 contours, HuBERT/Allosaurus speaking-rate, learned prosody embeddings
- Intelligibility: ASR activations (wav2vec 2.0, Whisper)
- Reference & Noise Statistics:
- Compute means/covariances (or 1-D sorted vectors) for each feature over large real speech corpora (e.g., YouTube, LDC) and for multiple noise datasets (uniform, Gaussian white noise, all 0s, all 1s)
- 2-Wasserstein Distance Computation:
- Calculate and for each feature using closed-form expressions.
- Normalization:
- Map distances to similarity scores via the formula above.
- Aggregation:
- Average to obtain factor-level and final TTSDS2 .
Notably, the pipeline does not require matched transcripts and uses non-overlapping real and noise corpora. Complete implementations are publicly available [https://github.com/ttsds/ttsds].
4. Large-Scale Subjective Listening Test Set
Validation of TTSDS2 is based on a unified listening test dataset comprising approximately 11,000 synthetic-utterance ratings over four English domains:
| Domain | Source | Description |
|---|---|---|
| Clean | LibriTTS test split | Clean, read audiobook speech |
| Noisy | Recent LibriVox (2025) | Unfiltered noisy audiobook speech |
| Wild | YouTube (2025) | Scraped “talk show”, “podcast”, etc. |
| Kids | My Science Tutor | Children with a virtual tutor |
- 200 native English annotators (UK/US, headphone-checked, \$10 each), 50 per domain.
- MOS: 6 pages × 5 samples, each page contains 1 ground-truth plus 4 synthetic (1–5 scale).
- CMOS/SMOS: 18 pairwise comparisons to ground truth (–3…+3 for CMOS, 1–5 for SMOS).
- Public release: 11,282 synthetic-utterance ratings (excluding GT) [https://huggingface.co/datasets/ttsds/listening_test].
5. Multilingual, Automated Test-Set Recreation
To support robust benchmarking and prevent data leakage from TTS training corpora, TTSDS2 uses an automated, quarterly-updated, 14-language pipeline:
- YouTube Scraping:
- Translate 10 English keywords (e.g. "interview", "sports commentary") per language.
- Search 250 long (≥20 min), popular videos per language.
- Whisper diarization and FastText language ID for clean target-language segments.
- Utterance Extraction and Filtering:
- Extract up to 16 single-speaker utterances from each video.
- XNLI controversial-content filter, Pyannote for crosstalk, Demucs for music removal.
- Pair Selection:
- Randomly select 50 distinct speakers × 2 utterances per language (one as reference/sample, one as input for synthesis).
- Synthesis & Scoring:
- Synthesize all reference–text pairs for each open-weight TTS system (=$20$+) via Replicate.com.
- Evaluate all samples with TTSDS2.
- Update public leaderboard quarterly [https://ttsdsbenchmark.com].
This procedure ensures continual refresh and expansion as new TTS models and languages become viable.
6. Comparative Spearman-Correlation Performance
TTSDS2 was benchmarked against 15 other objective evaluation metrics across four English domains and three subjective targets (MOS, CMOS, SMOS), using system-level Spearman’s . The summary findings are:
| Metric | Clean MOS | Noisy MOS | Wild MOS | Kids MOS | Average |
|---|---|---|---|---|---|
| TTSDS2 | 0.75 | 0.59 | 0.75 | 0.61 | 0.67 |
| RawNet3 Speaker | 0.36 | 0.44 | 0.85 | 0.73 | 0.60 |
| SQUIM-MOS | 0.68 | 0.48 | 0.62 | 0.57 | 0.57 |
| UTMOSv2 | 0.39 | 0.34 | 0.16 | 0.05 | 0.19 |
| FAD (CLAP) | –0.22 | 0.45 | –0.03 | 0.12 | 0.15 |
| STOI (intrusive) | –0.11 | –0.06 | 0.07 | –0.32 | 0.10 |
- TTSDS2 is the only metric with in every domain × measure.
- Speaker-embedding metrics excel on SMOS but less so on naturalness (MOS/CMOS), especially in clean speech.
- MOS-prediction deep nets collapse on out-of-domain samples.
7. Quarterly Updated Multilingual Benchmark
A continually updated benchmark, covering 14 languages for open-weight TTS systems (e.g., Bark, E2-TTS, F5-TTS, StyleTTS2, ParlerTTS, FishSpeech, XTTS), is maintained publicly. The "Wild" test set for each language is rebuilt quarterly following the pipeline above. Median TTSDS2 results:
| Language | Median TTSDS2 |
|---|---|
| English | 95 |
| Spanish | 92 |
| Mandarin | 89 |
| French | 90 |
| ... | ... |
Evaluation uses style-matched reference utterances and their transcripts, on single-turn utterances of 3–30s with no manual filtering beyond the pipeline. Full results and system rankings are public [https://ttsdsbenchmark.com].
TTSDS2 thus establishes a distributional, standardized, and robust approach to quantifying TTS quality. Its normalized scale, strong agreement with human scores across languages and domains, and compatibility with fully automated, leakage-resistant benchmarking underpin its widespread adoption for evaluation and leaderboard publication across the field (Minixhofer et al., 24 Jun 2025).