Papers
Topics
Authors
Recent
2000 character limit reached

ProsodyEval: Prosody Evaluation Datasets

Updated 18 January 2026
  • ProsodyEval is a collection of publicly available datasets and evaluation protocols designed to assess prosodic prominence, diversity, and transfer in speech synthesis and TTS applications.
  • It employs a CWT-based prominence extraction method achieving 85.3% accuracy and introduces DS-WED to robustly measure prosody diversity aligned with human PMOS ratings.
  • The protocols support cross-system comparisons, facilitate benchmarking of neural models including BERT-based classifiers, and advance research in expressiveness in synthetic speech.

ProsodyEval refers to multiple publicly available datasets designed to support rigorous evaluation and modeling of prosody in text and speech. These resources address deep learning benchmarks for prosodic prominence prediction from text, objective assessment of prosody diversity in synthetic speech, and, in some literature, serve as aliases for comprehensive prosody evaluation protocols. The term “ProsodyEval” specifically designates (1) a large-scale, word-level prominence benchmark ("ProsodyEval" (Talman et al., 2019)), (2) a human-judgment–centered prosody diversity benchmark ("ProsodyEval" (Yang et al., 24 Sep 2025)), and (3) is referenced for evaluation purposes in ADEPT ("ADEPT: A Dataset for Evaluating Prosody Transfer" (Torresquintero et al., 2021)). Each dataset targets different research foci, including TTS prosody transfer, prosodic prominence detection, and the objective measurement of prosodic variation. This article systematically reviews the construction, annotation, benchmarks, metrics, and research applications of these datasets.

1. Dataset Design and Composition

ProsodyEval is a large-scale benchmark for text-to-prosodic prominence prediction. The corpus is derived from the LibriTTS "clean" subset, consisting of 262.5 hours of English read-speech from 1,230 speakers. The corpus spans 159,850 sentences and 2,836,144 word tokens. Each token is automatically annotated with continuous and discrete prominence labels, enabling both binary and ternary classification settings.

Each data split is stored in CoNLL-style tab-delimited plain text, where each row corresponds to a word token with the following fields: utterance ID, word index, token, real-valued continuous prominence (from Continuous Wavelet Transform), label_3way (0/1/2), and label_2way (0/1). Sentences are separated by blank lines.

The ProsodyEval dataset for zero-shot TTS prosody diversity measurement consists of 1,000 synthetic speech samples generated by seven modern TTS systems, each conditioned on texts from LibriSpeech test-clean and Seed-TTS test-en. For every text prompt, five samples are synthesized per system, yielding systematic variation via random seeds while keeping speaker embedding and model parameters constant. Metadata for each sample includes: system identifier, corpus origin, input text, prompt speaker, seed, group ID (for rating), and file path.

The dataset is structured as pairs of samples for which human listeners rate perceived prosodic difference (PMOS) on a 1–5 scale. Associated .csv metadata and ratings files support joint analysis of objective metrics and subjective judgments.

Within the ADEPT corpus, "ProsodyEval" refers to a rigorously scripted evaluation protocol for prosody transfer in TTS, using a hand-curated set of 100–105 utterances (per speaker, per class) with systematically controlled prosodic variations. These utterances are annotated with class, subcategory, and speaker identifiers, designed for discriminative listening experiments.

2. Annotation Schemes and Human Ratings

Prominence labels are generated fully automatically using a Continuous Wavelet Transform (CWT)–based acoustic prominence extraction. For each word, pitch (F₀), energy, and duration are extracted, smoothed, z-normalized, and combined into a composite signal. The CWT is then applied to obtain a multi-scale prominence curve. Discrete labels are assigned: binary (non-prominent vs. prominent) and three-way (non-prominent, somewhat prominent, very prominent), with thresholds established via the Boston University Radio News corpus.

Unlike most prosody resources, no manual prosody labeling step occurs. Ground-truth labels are algorithmically validated: the CWT method achieves 85.3% word-level accent detection accuracy against Boston corpus experts.

For each pair of synthetic utterances with identical text and speaker style, 20 graduate TTS researchers rate the prosodic difference (PMOS) on a scale from 1 ("nearly identical") to 5 ("clearly distinct prosodic styles"). Each group of five sibling samples generates 10 unique pairs, for a total of 2,000 PMOS ratings across the corpus. Rater training and controlled listening environments are used, and groups with any alignment or synthesis error are excluded.

ADEPT leverages manual pretesting to admit only utterance classes and subcategories for which naïve listeners achieve at least 60% correct recognition in forced-choice tasks. Statistical validation uses binomial tests at 99% confidence.

3. Objective Metrics and Evaluation Protocols

Evaluation Benchmarks: Classification and Diversity

ProsodyEval (Talman et al., 2019) is used to benchmark text-based prominence prediction models. Tasks include binary (label_2way) and ternary (label_3way) classification; metrics are accuracy and macro-averaged F1. Models range from SVM+GloVe and CRF baselines to BiLSTM and BERT-base (fine-tuned). On the primary train-360 split (2.0M tokens), BERT achieves 83.2% (2-way) and 68.6% (3-way) accuracy. Cross-corpus validation supports comparability with expert-labeled corpora.

ProsodyEval (Yang et al., 24 Sep 2025) introduces the Discretized Speech Weighted Edit Distance (DS-WED): a weighted Levenshtein distance over semantic tokens derived from HuBERT or WavLM encoders and k-means clustering. Given two utterances’ discrete token sequences c1,c2c_1, c_2:

DS-WED(c1,c2)=minπA(c1,c2)(i,j,o)πwo  co(c1,i,c2,j)\mathrm{DS\text{-}WED}(c_1,c_2) = \min_{\pi\in\mathcal{A}(c_1,c_2)} \sum_{(i,j,o)\in\pi} w_o\;c_o\bigl(c_{1,i},c_{2,j}\bigr)

with higher weight on substitution (wsub=1.2w_\text{sub}=1.2) than insertion/deletion (wins=wdel=1.0w_\text{ins}=w_\text{del}=1.0), this measure demonstrates stronger correlation with PMOS than conventional log F0F_0 RMSE or MCD.

Recommended protocols for new TTS systems: synthesize five renditions per prompt, compute DS-WED on the 10 sample pairs, and report micro-average DS-WED and supporting acoustic metrics.

4. Benchmark Results and Model Comparison

Full-data BERT models significantly outperform classical baselines (SVM, CRF, BiLSTM) on both 2-way and 3-way classification tasks. Notably, BERT approaches its full performance with only 10–20% of the training data, indicating strong inductive bias for the task. Cross-corpus validation on Boston University Radio News shows comparable results to expert annotators.

DS-WED exhibits substantially improved alignment with human PMOS ratings compared to acoustic metrics. Systematic benchmarking reveals that non-AR flow and MGM models show heavier tails for prosodic diversity. Additional analysis suggests generative paradigm (autoregressive, flow, masked generation), speaker duration control, and RL tuning all influence attainable diversity, while state-of-the-art large audio LLMs exhibit limited prosodic variation.

The ADEPT/ProsodyEval evaluation protocol establishes natural-speech benchmarks for class-specific recognition accuracy. Systems such as Ctrl-P (supervised phone-level contour modeling) outperform Tacotron-Ref (unsupervised reference encoder) in most classes, especially in syntactic phrasing and topical emphasis, approaching the recognition levels of the reference natural speech.

5. Data Access, Licensing, and Usage

Dataset/Protocol Download/Access License/Terms
ProsodyEval (Talman et al., 2019) https://github.com/Helsinki-NLP/prosody Open, citation required
ProsodyEval (as in (Yang et al., 24 Sep 2025)) https://prosodyeval.github.io On release, as announced
ProsodyEval protocol (ADEPT (Torresquintero et al., 2021)) https://zenodo.org/record/5117102 CC BY 4.0

All datasets are available for academic and non-commercial research with open or CC BY licensing. Best practices include using provided protocols, reporting official metrics, and contributing implementation details for stimulus presentation and statistical analysis to the community.

6. Limitations and Future Directions

The ProsodyEval prominence benchmark (Talman et al., 2019) is limited by potential alignment noise, speaker variability, and the archaic nature of some source material (LibriSpeech: pre-1923 audiobooks). The evaluation is currently limited to sentence-level context: cross-sentence and dialogue effects are not represented. Planned improvements include a manually corrected held-out test subset, speaker-adaptive modeling, and the inclusion of further prosodic boundary labels (phrasing, pausing).

The ProsodyEval diversity metric (Yang et al., 24 Sep 2025) excludes explicit inter-rater–agreement statistics, instead demonstrating reliability indirectly via human–metric correlations. The corpus does not distinguish between types of prosodic diversity (emotion, phrasing, prominence), nor does it include manual ToBI-style labeling. A plausible implication is that as zero-shot TTS progresses, future versions of ProsodyEval may integrate more explicit multi-dimensional prosody annotations and introduce task-specific sub-benchmarks.

ADEPT’s ProsodyEval protocol deliberately prunes weakly perceptible prosodic subcategories, focusing only on those with listener recognition above 60%. This restricts the evaluation set to reliably discriminable categories, potentially omitting subtle prosodic effects.

7. Research Applications and Significance

ProsodyEval datasets and protocols collectively enable standardized, reproducible evaluation for prosody-predictive models, prosody transfer in TTS, and prosody diversity in speech synthesis. They serve as training and assessment resources for neural architectures (including BERT-based sequence classifiers), facilitate objective comparisons across TTS paradigms, and ground new metrics such as DS-WED in both objective and subjective criteria. As symbolic prosody prediction remains a key bottleneck in expressive speech applications, these resources have become foundational for both ASR/NLP and TTS/SSMT communities, supporting rapid methodological progress and robust cross-lab benchmarking (Talman et al., 2019, Torresquintero et al., 2021, Yang et al., 24 Sep 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProsodyEval Dataset.