TTSDS2: Robust TTS Quality Metric

Updated 30 November 2025

TTSDS2 is an objective, distributional metric that assesses TTS quality by comparing perceptual feature distributions from synthetic, real, and noise speech.
It overcomes limitations of traditional signal-based and MOS-prediction methods, enabling scalable, reference-free evaluation across multiple languages and domains.
TTSDS2 employs the 2-Wasserstein distance on features like SSL embeddings, speaker ID, prosody, and ASR activations, showing strong correlation with human judgments.

Text-to-Speech Distribution Score 2 (TTSDS2) is an objective, non-intrusive, distributional metric for robustly quantifying the human-perceived quality of Text-to-Speech (TTS) systems across multiple domains and languages. Developed to address limitations in prior metrics—such as the inadequacy of signal-based or WER/CER metrics for modern neural TTS and the in-domain bias or collapse of MOS-prediction networks—TTSDS2 models the similarity between the empirical distributions of multiple perceptual features extracted from synthetic speech and from genuine speech, normalized against various forms of noise. This scalar metric in $[0,100]$ demonstrates high correlation with human judgments (MOS, CMOS, SMOS) across diverse settings and supports a multilingual, continually refreshed open benchmark (Minixhofer et al., 24 Jun 2025).

1. Motivation and Conceptual Basis

TTSDS2 was introduced to overcome two key limitations observed in the evaluation of recent, highly naturalistic TTS outputs. First, classic signal-matching or character/word error rate (WER/CER) measures fail when outputs are highly natural, potentially exhibiting "oversmoothing", speaker-mismatch, or not aligning directly with a ground-truth reference. Second, learned MOS/CMOS predictors (e.g., MOSNet, UTMOS) display strong drops in out-of-domain correlation and are commonly only trained in-domain. TTSDS2's distributional design directly compares the statistical behavior of multiple perceptual feature spaces—covering generic SSL embeddings, speaker ID, prosodic characteristics, and ASR-derived intelligibility—between synthetic, real, and noise utterance corpora. This framework is reference-free for content, scalable to new domains and languages, and non-intrusive in the sense that it does not require parallel reference text.

2. Formal Mathematical Framework

TTSDS2 proceeds by quantifying, for each perceptual feature $X$ , how much the distribution of $X$ in TTS-generated utterances ( $D_{\textrm{syn}}$ ) resembles that of real speech ( $D_{\textrm{real}}$ ) as opposed to distractor (noise) distributions ( $\{D_{\textrm{noise}}^k\}_k$ ). For each $X$ , the empirical distributions over real, synthetic, and noise datasets are $\hat P_{\rm real}(X)$ , $\hat P_{\rm syn}(X)$ , and $\hat P_{\rm noise}^k(X)$ , respectively.

Similarity is assessed via the 2-Wasserstein distance, $W_2(\hat P_1, \hat P_2)$ . For multivariate Gaussian cases:

$W_2(\hat P_1, \hat P_2) = \| \mu_1 - \mu_2 \|_2^2 + \mathrm{Tr} (\Sigma_1 + \Sigma_2 - 2(\Sigma_2^{1/2}\Sigma_1 \Sigma_2^{1/2})^{1/2}),$

where $\mu_i, \Sigma_i$ are the mean and covariance. For 1-D sorted samples:

$W_2(\hat P_1, \hat P_2) = \sqrt{ \frac{1}{n} \sum_{i=1}^n (x_i - y_i)^2 }.$

For a feature $X$ :

$W_{\rm real}(X) = W_2(\hat P_{\rm real}(X), \hat P_{\rm syn}(X)), \quad W_{\rm noise}(X) = \min_k W_2(\hat P_{\rm noise}^k(X), \hat P_{\rm syn}(X))$

The normalized similarity score is

$S(X) = 100 \cdot \frac{ W_{\rm noise}(X) }{ W_{\rm real}(X) + W_{\rm noise}(X) }$

This yields $S(X)=100$ when synthetic matches real, $S(X)=0$ when synthetic matches noise, and $S(X)>50$ indicates "closer to real" than to noise. Features are grouped by factor—Generic (SSL embeddings), Speaker (ID embeddings), Prosody (F0/rate), and Intelligibility (ASR activations). Each factor’s score $S_f$ is the unweighted average over its group, and the TTSDS2 is the global average:

$\mathrm{TTSDS2} = \frac{1}{4} \sum_{f \in \{ \mathrm{Generic, Speaker, Prosody, Intel.} \}} S_f$

3. Feature Extraction and Computation Pipeline

Computation of TTSDS2 for a given TTS system involves five high-level steps:

Feature Extraction:
- Generic: WavLM, HuBERT/mHuBERT-147, wav2vec 2.0, XLSR-53 embeddings
- Speaker: d-Vector, WeSpeaker embeddings
- Prosody: PyWORLD F0 contours, HuBERT/Allosaurus speaking-rate, learned prosody embeddings
- Intelligibility: ASR activations (wav2vec 2.0, Whisper)
Reference & Noise Statistics:
- Compute means/covariances (or 1-D sorted vectors) for each feature over large real speech corpora (e.g., YouTube, LDC) and for multiple noise datasets (uniform, Gaussian white noise, all 0s, all 1s)
2-Wasserstein Distance Computation:
- Calculate $W_{\rm real}$ and $W_{\rm noise}$ for each feature using closed-form expressions.
Normalization:
- Map distances to similarity scores via the formula above.
Aggregation:
- Average to obtain factor-level and final TTSDS2 $\in[0,100]$ .

Notably, the pipeline does not require matched transcripts and uses non-overlapping real and noise corpora. Complete implementations are publicly available [https://github.com/ttsds/ttsds].

4. Large-Scale Subjective Listening Test Set

Validation of TTSDS2 is based on a unified listening test dataset comprising approximately 11,000 synthetic-utterance ratings over four English domains:

Domain	Source	Description
Clean	LibriTTS test split	Clean, read audiobook speech
Noisy	Recent LibriVox (2025)	Unfiltered noisy audiobook speech
Wild	YouTube (2025)	Scraped “talk show”, “podcast”, etc.
Kids	My Science Tutor	Children with a virtual tutor

200 native English annotators (UK/US, headphone-checked, \$10 each), 50 per domain.
MOS: 6 pages × 5 samples, each page contains 1 ground-truth plus 4 synthetic (1–5 scale).
CMOS/SMOS: 18 pairwise comparisons to ground truth (–3…+3 for CMOS, 1–5 for SMOS).
Public release: 11,282 synthetic-utterance ratings (excluding GT) [https://huggingface.co/datasets/ttsds/listening_test].

5. Multilingual, Automated Test-Set Recreation

To support robust benchmarking and prevent data leakage from TTS training corpora, TTSDS2 uses an automated, quarterly-updated, 14-language pipeline:

YouTube Scraping:
- Translate 10 English keywords (e.g. "interview", "sports commentary") per language.
- Search 250 long (≥20 min), popular videos per language.
- Whisper diarization and FastText language ID for clean target-language segments.
Utterance Extraction and Filtering:
- Extract up to 16 single-speaker utterances from each video.
- XNLI controversial-content filter, Pyannote for crosstalk, Demucs for music removal.
Pair Selection:
- Randomly select 50 distinct speakers × 2 utterances per language (one as reference/sample, one as input for synthesis).
Synthesis & Scoring:
- Synthesize all reference–text pairs for each open-weight TTS system (=$20$+) via Replicate.com.
- Evaluate all samples with TTSDS2.
- Update public leaderboard quarterly [https://ttsdsbenchmark.com].

This procedure ensures continual refresh and expansion as new TTS models and languages become viable.

6. Comparative Spearman-Correlation Performance

TTSDS2 was benchmarked against 15 other objective evaluation metrics across four English domains and three subjective targets (MOS, CMOS, SMOS), using system-level Spearman’s $\rho$ . The summary findings are:

Metric	Clean MOS	Noisy MOS	Wild MOS	Kids MOS	Average
TTSDS2	0.75	0.59	0.75	0.61	0.67
RawNet3 Speaker	0.36	0.44	0.85	0.73	0.60
SQUIM-MOS	0.68	0.48	0.62	0.57	0.57
UTMOSv2	0.39	0.34	0.16	0.05	0.19
FAD (CLAP)	–0.22	0.45	–0.03	0.12	0.15
STOI (intrusive)	–0.11	–0.06	0.07	–0.32	0.10

TTSDS2 is the only metric with $\rho>0.5$ in every domain × measure.
Speaker-embedding metrics excel on SMOS but less so on naturalness (MOS/CMOS), especially in clean speech.
MOS-prediction deep nets collapse on out-of-domain samples.

7. Quarterly Updated Multilingual Benchmark

A continually updated benchmark, covering 14 languages for open-weight TTS systems (e.g., Bark, E2-TTS, F5-TTS, StyleTTS2, ParlerTTS, FishSpeech, XTTS), is maintained publicly. The "Wild" test set for each language is rebuilt quarterly following the pipeline above. Median TTSDS2 results:

Language	Median TTSDS2
English	95
Spanish	92
Mandarin	89
French	90
...	...

Evaluation uses style-matched reference utterances and their transcripts, on single-turn utterances of 3–30s with no manual filtering beyond the pipeline. Full results and system rankings are public [https://ttsdsbenchmark.com].

TTSDS2 thus establishes a distributional, standardized, and robust approach to quantifying TTS quality. Its normalized $[0,100]$ scale, strong agreement with human scores across languages and domains, and compatibility with fully automated, leakage-resistant benchmarking underpin its widespread adoption for evaluation and leaderboard publication across the field (Minixhofer et al., 24 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Text-to-Speech Distribution Score 2 (TTSDS2).