EmergentTTS-Eval Benchmark

Updated 5 November 2025

EmergentTTS-Eval is a systematic benchmark that assesses TTS systems on nuanced, complex linguistic tasks using automated LLM and LALM evaluation protocols.
It employs a two-stage prompt generation and model-as-a-judge framework to deliver fine-grained diagnostics across six evaluation axes including emotions, syntax, and code-switching.
The open-source framework offers scalable, reproducible, and scenario-specific analysis, aligning with responsible TTS evaluation practices.

EmergentTTS-Eval is a systematic benchmark and model-based evaluation framework designed to rigorously assess the performance of modern Text-to-Speech (TTS) systems on challenging, nuanced, and semantically complex scenarios. Developed to address the limitations of traditional intelligibility- and naturalness-centered evaluation, EmergentTTS-Eval provides a scalable, extensible, and discriminative methodology for benchmarking TTS performance in real-world, emergent use cases. The framework leverages automated prompt generation via LLMs and audio evaluation via Large Audio LLMs (LALMs) acting as model-as-a-judge, creating a reproducible standard for fine-grained TTS assessment.

1. Motivation and Rationale

Conventional TTS benchmarks and scoring protocols, such as human Mean Opinion Score (MOS), predominantly focus on clear, well-formed text and offer limited diagnostic power for emergent phenomena: nuanced prosody, emotion, syntactic complexity, code-switching, complex pronunciation, paralinguistics, and question intonation. Current MOS-based systems are known to saturate at high performance, suffer from limited cross-study transferability, and are resource-intensive to administer at scale (Manku et al., 29 May 2025). EmergentTTS-Eval was created to fill this gap, enabling assessment on the domains where human listeners and users perceive critical performance differences between TTS systems—particularly at or near human parity.

2. Benchmark Design: Categories and Dataset Construction

EmergentTTS-Eval comprises 1,645 test cases systematically categorized along six axes:

Emotions: Rendering of psychologically and contextually appropriate prosodic patterns associated with joy, sadness, surprise, anger, as well as dynamic emotional switching and intensification.
Paralinguistics: Handling of onomatopoeia, interjections, emphasis (e.g., capitalization, elongation), stuttering, pacing cues (e.g., ellipsis), and other non-lexical vocal behaviors.
Foreign Words: Code-switching performance and pronunciation accuracy for phrases from 15 languages interleaved in English sentences (using Latin script).
Syntactic Complexity: Prosodic management and disambiguation of garden-path sentences, heavily nested clauses, phrase-level punctuation, and homographs.
Complex Pronunciation: Ability to correctly produce URLs, email addresses, phone numbers, mathematical and scientific formulae, tongue-twisters, acronyms, and initialisms.
Questions: Proper marking of pitch and prosody in overt, embedded, or sequential interrogative constructions.

Prompt generation employs a two-stage LLM-based pipeline: breadth expansion (increasing coverage across linguistic and structural variants) and depth expansion (iterative creation of more syntactically and prosodically demanding cases). Initial seeds (derived from BASE-TTS) are extended via GPT-4, Gemini 2.5 Pro, Claude 3.7 Sonnet, and similar models with explicit instructions to target structural, prosodic, and semantic challenges relevant to each category. Further LLM passes ensure grammaticality and plausibility of difficult cases. This process yields a deeply layered and diverse challenge suite.

Category	Number of Prompts	Depth/Breadth Structure Included?
Emotions	280	Yes
Paralinguistics	280	Yes
Foreign Words	280	Yes
Syntactic Complexity	280	Yes
Complex Pronunciation	240 (+ repeats)	Yes
Questions	280	Yes

3. Evaluation Methodology: Model-as-a-Judge Protocol

Assessment within EmergentTTS-Eval is performed via a model-as-a-judge paradigm using LALMs. The process comprises the following steps:

System Synthesis: Each TTS system is tasked with synthesizing all prompts. For benchmarking, candidate systems are evaluated pairwise against a strong baseline (e.g., GPT-4o-mini-tts with Alloy voice).
Random Assignment: System outputs are randomly mapped to evaluation slots ( $T_1$ , $T_2$ ) to avoid positional bias in the judge's decisions.
Evaluation Prompting: The LALM, generally Gemini 2.5 Pro due to its domain performance, receives the reference text, category label, and detailed evaluation rubric tailored to the scenario (e.g., emotion appropriateness, prosodic shape for syntax, language accuracy, etc.), along with both synthesized audios.
LALM Decision: The model provides:
- Justification and timestamped, chain-of-thought analysis per system
- Comparative analysis distinguishing between subtle and major differences
- Graded category scores (0–3 scale) and a final winner label (0: tie, 1: system 1, 2: system 2)

Win Rate Metric

System performance is summarized by the win rate $W(T_i)$ versus the baseline $T_j$ : $W(T_i) = \frac{\sum(\text{winner} = \text{index}_i) + 0.5 \cdot \sum(\text{winner} = 0)}{n}$ where $n$ is the number of pairwise comparisons. $W=0.5$ is on par with the baseline; $W>0.5$ indicates superiority, $W<0.5$ inferiority.

Rubrics are strictly scenario-specific to ensure interpretability and to direct the model's attention to relevant competencies (e.g., accuracy of prosodic breaks for complex syntax, fluency and native-like accent for foreign words, or pitch contouring in questions). Acoustic artifacts unrelated to the core category are explicitly required to be ignored.

4. Human Alignment, Validation, and Analytic Insights

EmergentTTS-Eval includes a human evaluation phase for validation. Raters rank audio pairs synthesized by candidate systems, and agreement is quantitatively assessed:

Spearman's $\rho \sim 0.9$ between LALM win rates and human preferences, supporting high alignment.
High inter-model judge reliability: Kendall’s $W=0.97$ across distinct LALMs.
Moderate inter-human agreement: Krippendorff's $\alpha=0.5073$ reflects subjective variance inherent in human listening tests.

Findings reveal that LALM-based evaluation is both more scalable and consistent than exhaustive subjective listening. Cost, latency, and rater bias issues of classical MOS are substantially mitigated.

Diagnostic Power

The framework exposes several systematic failure modes:

Open-source models demonstrate flat intonation, error-prone code-switching (anglicized pronunciations), and limited expressive range.
Commercial models occasionally falter on long or heavily code-switched prompts, and tongue-twisters remain a pervasive challenge across all systems.

Performance sensitivity to synthesis voice selection was observed in emotional and paralinguistic categories, implying the need for scenario-tailored training or fine-tuning.

5. Automation, Extensibility, and Open-Source Tooling

All prompt generation, evaluation, and scoring pipelines are modular and scriptable, supporting rapid extension to new languages, categories, or challenge types. Both the dataset and evaluation codebase are open-sourced (HuggingFace dataset, Github code). Prompt templates, rubrics, and guidance for both generation and LALM-based evaluation are provided for expert customization and reproducibility.

Resource	Description	Link
Benchmark dataset	1,645 multiscenario prompts	https://huggingface.co/datasets/bosonai/EmergentTTS-Eval
Evaluation code	Full pipeline scripts	https://github.com/boson-ai/EmergentTTS-Eval-public

Automation supports large-scale, reliable evaluation without the scaling issues of human listening panels, and enables continual adaptation to emergent capabilities or linguistic phenomena.

6. Performance Results and Research Implications

EmergentTTS-Eval revealed win rates ranging from approximately 9% (weakest open-source) to 65% (best proprietary models). Notably, best-in-class models (e.g., OpenAI GPT-4o-audio Ballad) achieve 88.84% win rate for Emotions and 82.14% for Paralinguistics, with robust performance in other categories, whereas open-source models trail on foreign words and complex pronunciation tasks.

Prompt-specific improvements—such as LLM-guided text normalization—yield significant boosts in challenging domains (e.g., complex technical or non-standard entities), confirming the sensitivity of TTS models to input preprocessing. Notably, MOS ratings do not consistently track with win rate on highly emergent challenges, vindicating the need for scenario- and category-aware evaluation.

A plausible implication is that the combination of scenario-based rubrics and model-judge methods will facilitate research into targeted model improvements (e.g., code-switching robustness, expressive prosody, or complex entity reading), as well as the rapid adoption of new evaluation scenarios within industrial and open-source TTS development.

7. Context in the Responsible TTS Evaluation Landscape

EmergentTTS-Eval aligns with and directly implements several principles highlighted in responsible evaluation frameworks (Yang et al., 8 Oct 2025): multi-level, interpretable, and scenario-specific assessment; model-based judge protocols for transferability and cross-model comparison; full transparency in test definition and score computation; and category-specific analysis to support accountability and performance diagnostics. Its approach addresses both the technical and practical requirements of next-generation TTS benchmarking, as well as ethical mandates for robustness, comparability, and inclusivity in evaluation.

Summary Table: EmergentTTS-Eval Key Features

Feature	Description
Categories	Six (Emotions, Paralinguistics, Foreign Words, Syntactic Complexity, Complex Pronunciation, Questions)
Dataset Size	1,645 multi-depth prompts
Test Case Generation	Automated LLM-based (breadth & depth expansion)
Evaluation Protocol	LALM model-as-a-judge, scenario-specific rubrics, win rates
Human Alignment	High ( $\rho \sim 0.9$ Spearman with human judgments)
Open Source	Dataset and code published, modular extensibility

EmergentTTS-Eval establishes a reproducible, extensible benchmark paradigm for TTS, with fine-grained analytics for complex, real-world speech synthesis scenarios, scalable automation for research and deployment, and empirical grounding through systematic human correlation and cross-model transferability (Manku et al., 29 May 2025).