Expressive Speech Corpora Overview

Updated 6 September 2025

Expressive speech corpora are curated datasets that capture rich prosody, emotional nuances, and nonverbal vocalizations for detailed speech analysis.
They employ diverse methodologies including precise acoustic, textual, and paralinguistic annotations to support advanced TTS and ASR applications.
These resources are essential for benchmarking speech synthesis and recognition systems using both objective metrics and subjective listening evaluations.

Expressive speech corpora are datasets explicitly constructed or curated to capture the rich variability, prosody, emotional coloring, nonverbal vocalizations, and contextual dynamism characteristic of human speech beyond the simple lexical content. Unlike conventional read or neutral speech corpora, expressive corpora systematically incorporate phenomena such as spontaneity, disfluencies, dialogue, emphasis, paralinguistic cues, emotion, character mimicry, and detailed textual/acoustic context. These resources form the empirical foundation for state-of-the-art research in Text-to-Speech (TTS), Automatic Speech Recognition (ASR), prosody modeling, emotion and style transfer, and multimodal dialogue systems.

1. Defining Expressive Speech Corpora

Expressive speech corpora are distinguished by their focus on capturing prosodic richness (pitch, energy, rhythm variability), emotional state (happy, sad, angry, sarcastic, etc.), spontaneous stylistic shifts, nonverbal vocalizations (NVs; e.g., laughter, sighs, stuttering), and context-dependent delivery. They may target:

Narrative storytelling (e.g., EMNS (Noriy et al., 2023))
Multispeaker conversational interaction (e.g., Expresso (Nguyen et al., 2023), CORAA (Junior et al., 2021))
Expressive character utterances for zero-shot synthesis (e.g., LibriQuote (Michel et al., 4 Sep 2025))
Native and non-native pronunciation across expressive contexts (e.g., speechocean762 (Zhang et al., 2021))
Paralinguistic annotation and spontaneous phenomena (e.g., DisfluencySpeech (Wang et al., 13 Jun 2024), NonverbalTTS (Borisov et al., 17 Jul 2025))

Key properties:

Rich metadata (speaker demographics, emotion/politeness labels, context, scene descriptions)
Multi-level transcriptions—sometimes with aligned nonverbal annotation
High acoustic variability (e.g., pitch standard deviation in StoryTTS (Liu et al., 23 Apr 2024), diversity in creative/imagined content in ÌròyìnSpeech (Ogunremi et al., 2023))

2. Methodologies for Construction and Annotation

Construction methodologies typically involve participatory or crowd-sourced recording, in-studio high-fidelity sessions, or automated extraction/annotation from existing content. Annotation frameworks vary in granularity and semantic richness:

Annotation Levels:

Acoustic expressiveness: Detailed labeling of pitch, energy, rhythm, and prosodic events (as in StoryTTS (Liu et al., 23 Apr 2024)—log F0 RMSE, Mel Cepstral Distortion).
Textual expressiveness: Dimensions such as rhetorical devices, sentence patterns, scene, character imitation, and emotional color (StoryTTS), or pseudo-linguistic cues like verbs/adverbs (LibriQuote—e.g. $\{\,\texttt{verb: “whispered”},\ \texttt{adverb: “softly”}\,\}$ ).
Nonverbal events: Explicit annotation of NVs (breath, laughter, cough, sigh; NonverbalTTS), paralinguistic content (DisfluencySpeech), and spontaneous speech phenomena (Lahjoita puhetta—“.laugh”, “.cough”, as well as filled pauses and truncations).

Automated and Human-Machine Pipelines:

ASR for transcript generation from raw audio (VoxCeleb section in NonverbalTTS, Whisper ASR in SPIRE-SIES (Singh et al., 2023))
Deep models for NV and emotion detection/classification (BEATs, emotion2vec+, Canary)
Merge and majority voting fusion algorithms to synthesize multiple annotation streams (NonverbalTTS)
LLMs for batch textual expressiveness annotation (StoryTTS: GPT-4, Claude2, few-shot learning guidance)
Dedicated web apps and collection tools with remote annotation capability and session tracking (EMNS, SPIRE-SIES, EXPRESSO)

Corpus	Speech Type	Annotation Modalities
LibriQuote (Michel et al., 4 Sep 2025)	Audiobook narrative + character utterances	Speaker context, pseudo-label
EXPRESSO (Nguyen et al., 2023)	Read + spontaneous expressive dialogues	26 style classes, NV
StoryTTS (Liu et al., 23 Apr 2024)	Highly prosodic Mandarin storytelling	5-dim textual expressiveness
NonverbalTTS (Borisov et al., 17 Jul 2025)	Multi-source English with NVs/emotions	10 NV types, 8 emotions

3. Benchmarking and Evaluation Protocols

Expressive speech corpora facilitate rigorous benchmarking using objective and subjective metrics:

Objective measures: Mel Cepstral Distortion (MCD), Word/Character Error Rate (WER/CER), log F0 RMSE, ABX discrimination (phonetic contrasts), speaker similarity (SIM-o), emotion similarity (EMO-SIM), F0 Frame Error (FFE), Jaccard distances for NV fidelity.
Subjective measures: Mean Opinion Score (MOS) for naturalness/expressiveness, Comparative MOS (CMOS), expressive style classification accuracy (as in EMNS listening tests), and human preference testing against baseline systems (NonverbalTTS vs CosyVoice2).

Experimental setups frequently implement cross-domain or cross-sentence paradigms (such as LibriQuote’s “neutral reference for timbre, expressive target generation” paradigm), ablation studies (NV and emotion removed, NonverbalTTS), or multi-level transcript evaluation (DisfluencySpeech’s A/B/C transcript variants).

4. Technical Architectures and Modeling Advances

Modern expressive corpus research is tightly coupled with advances in TTS, ASR, and representation learning:

Seq2seq and self-attention architectures: For prosody mining, layer-wise aggregation, and manipulation of sentential context (e.g., SAN-based context extractor in (Yang et al., 2020), weighted aggregation via multi-head attention).
VQ-based and textless synthesis: Discrete speech unit resynthesis using SSL models (HuBERT, Encodec, PNMI quantification (Nguyen et al., 2023)), enabling control over expressivity beyond the text-domain.
Diffusion models for prosody diversity: Sampling multiple contextually plausible prosody embeddings from conversational history (DiffCSS (Wu et al., 27 Feb 2025)):

$q(z_t | z_{t-1}) = \mathcal{N}(z_t;\sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I)$

$z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$

Alignment between prosody, speaker, and context is refined via cross-attention and iterative denoising.

Hybrid modeling: Fairseq S² Transformer-autoregressive TTS trained on expressive transcript levels (DisfluencySpeech), vocoder integration (HiFi-GAN, WORLD vocoder), and end-to-end modeling (VQTTS with expressiveness encoder—StoryTTS).

5. Diversity, Domains, and Cross-Linguistic Coverage

Expressive corpora now span a broad linguistic, demographic, and content range:

Languages: English (LibriQuote, EXPRESSO, NonverbalTTS, DisfluencySpeech), Yoruba (ÌròyìnSpeech), Finnish (Lahjoita puhetta), Mandarin (StoryTTS), Japanese (CALLS).
Speaker Diversity: Ranging from single-speaker curated TTS corpora (EMNS, DisfluencySpeech) to thousands of speakers for large-scale ASR/expressive analysis (Lahjoita puhetta, CORAA, SPIRE-SIES).
Domain and style coverage: News, creative writing, spontaneous phone dialogue (CALLS), narrative storytelling (StoryTTS, LibriQuote), customer center empathy (CALLS), gaming/media (EMNS), informal conversational style (DisfluencySpeech).
Non-native and dialectal variation: speechocean762 (Mandarin L1, English L2 pronunciation), SPIRE-SIES (12 Indian nativities), dialect metadata in Lahjoita puhetta.

6. Applications and Future Directions

Expressive speech corpora advance the state-of-the-art in TTS, ASR, voice assistants, conversational agents, and multimodal frameworks:

Expressive TTS: Synthesis models trained on expressive corpora can generate speech mimicking nuanced prosody, emotion, nonverbal cues, style, and contextual adaptation (StoryTTS: MOS = 4.09 with all labels; NonverbalTTS matches CosyVoice2 in speaker/NV fidelity).
Emotion and style transfer: Detailed annotations enable precise control of expressive features for style/emotion transfer tasks (pseudo-labels in LibriQuote, multidimensional textual expressiveness in StoryTTS).
Conversational AI: Multimodal synthetic agents leverage expressive datasets for contextually coherent, variable dialogue delivery (DiffCSS: MOS 3.602 for expressiveness).
Pronunciation and language learning: Multilevel annotations (sentence/word/phoneme) and native/non-native contrast (speechocean762) support computer-assisted training (CAPT).
Bias and diversity mitigation: Rich meta-data analysis (gender/age/dialect in Lahjoita puhetta) informs strategies to improve model fairness and accuracy on underrepresented groups.
Technical inclusivity: Participatory approaches in ÌròyìnSpeech and open-access datasets (NonverbalTTS, People’s Speech) broaden linguistic coverage and democratize expressive corpus construction.
Novel modeling research: Future directions include scaling annotation frameworks with LLMs, advances in prosody disentanglement, enhanced adaptation techniques, and integration with cross-modal stimuli (image-to-speech with SPIRE-SIES).

7. Limitations and Challenges

Expressive corpora face intrinsic challenges:

Annotation reliability and semantic granularity (e.g., human validation pipeline for NV/emotion in NonverbalTTS, fusion algorithms).
Balancing domain-specific expressiveness with data scale (multi-domain adaptation in CALLS/STUDIES).
Complexity of capturing spontaneous and paralinguistic feature variation, managing tradeoffs between synthesis quality, fidelity, and controllability (Expresso unit-based resynthesis, DiffCSS diversity–coherence balance).
Underrepresentation in low-resource languages remains a critical bottleneck, addressed by participatory models and open licensing (ÌròyìnSpeech, Lahjoita puhetta).

Conclusion

Expressive speech corpora constitute a pivotal resource for research and development in speech technologies seeking to transcend neutral, read, and mechanistic utterance generation. As illustrated across diverse datasets—LibriQuote, Expresso, NonverbalTTS, StoryTTS, DisfluencySpeech, ÌròyìnSpeech, and others—these corpora are characterized by multi-layered annotation, domain breadth, emotional and stylistic nuance, advanced modeling pipelines, and rigorous benchmarking. Ongoing research is directed toward expanding linguistic and expressive diversity, refining annotation and synthesis techniques, and enabling widespread access for academic and industrial progress.