LibriQuote: Expressive Speech Dataset
- LibriQuote is a large-scale dataset featuring over 18,000 hours of English speech split into neutral narration and expressive character quotations.
- It includes detailed contextual annotations and pseudo-labels that capture prosodic cues, enabling precise modeling of emotion, accent, and naturalness.
- The dataset supports robust TTS benchmarking through both objective metrics like WER and MCD and subjective evaluations such as MOS for expressive speech synthesis.
LibriQuote is a large-scale, English speech dataset specifically structured for advancing expressive, zero-shot text-to-speech (TTS) synthesis. Derived from LibriVox audiobook recordings, LibriQuote isolates character quotations—a source of expressive, emotional prosody—from non-expressive narration, and further enriches these samples with contextual and linguistic annotations. The resource is designed to both fine-tune and benchmark TTS systems on expressive tasks, emphasizing the modeling of emotion, accent, and naturalness in synthesized speech (Michel et al., 4 Sep 2025).
1. Dataset Organization and Composition
LibriQuote is partitioned into a training set containing 12,723 hours of non-expressive narration and 5,359 hours of expressive speech, totaling over 18,000 hours. Non-expressive narration consists primarily of neutral reading, while expressive utterances are sourced from fictional character dialogue within the audiobooks. Each quotation is meticulously aligned with its original textual and audio context, distinguishing LibriQuote from standard audiobook datasets.
Subset | Speech Type | Hours |
---|---|---|
Neutral | Non-expressive | 12,723 |
Expressive | Character quotations | 5,359 |
This division enables explicit modeling of expressive and non-expressive speech phenomena and allows quantitative analysis of their impact on TTS output quality.
2. Expressive Speech Annotation and Contextualization
Expressive utterances in LibriQuote are extensively annotated. Notably, each expressive quotation is accompanied by a contextual window—typically ∼100 words drawn from the original text surrounding the quotation, bounded by paragraph limits. This provides both prosodic and semantic context for downstream modeling.
Pseudo-labels are generated for each expressive utterance using LLMs (e.g., Phi-4 and a filtered variant Phi-4₍conf₎). These labels capture speech verbs and adverbs from narrative cues, encoding prosodic intent (for example, “he whispered softly” yields “whispered”—a speech verb—and “softly”—an adverb). Such annotations function as an interpretable expressivity signal, guiding both supervised and self-supervised modeling of expressive prosody.
3. Test Set Design and Benchmarking Protocol
The LibriQuote test set includes 7.5 hours of expressive quotations from 15 speakers, chosen to be strictly unseen with respect to standard datasets, such as LibriSpeech. The primary evaluation protocol is a “cross-sentence” challenge: a neutral reference utterance (from the same speaker, but outside the expressive set) is used as input, while the system is tasked with synthesizing a target quotation in an expressive style, preserving the reference timbre.
Qualitative t-SNE analysis of latent emotion and accent representations in the test set demonstrates coverage of diverse emotional states and a broad spectrum of accented speech. This ensures the dataset functions as a robust benchmark for expressive speech synthesis, expanding the evaluation beyond intelligibility into fine-grained expressivity and timbral fidelity.
4. Objective and Subjective Evaluation Metrics
LibriQuote supports comprehensive evaluation of TTS models using both objective and subjective metrics. Objective metrics include:
- Word Error Rate (WER): Measures speech intelligibility.
- Speaker Similarity (SIM-O): Cosine-based metric for timbral preservation.
- Mel Cepstral Distortion (MCD): Quantifies spectral accuracy.
- F0 Pearson Correlation (FPC): Captures prosodic alignment.
- Accent and Emotion Similarity: Assessed via latent embedding comparisons.
Subjective evaluations comprise Mean Opinion Score (MOS) for naturalness and Comparative MOS (CMOS) tests for expressivity. Fine-tuning baseline TTS on LibriQuote results in improved WER, higher FPC, and lower MCD, indicating more intelligible and prosodically accurate synthesis. However, subjective ratings reveal model outputs still lag behind ground-truth quotations in perceived expressivity and naturalness.
5. Data Efficiency via Expressive Subset Filtering
Experiments with a filtered expressive subset (denoted Q_f), containing only quotations with non-empty adverb pseudo-labels or speech verbs from a curated list, demonstrate comparable improvements in intelligibility and expressivity using only ~10% of the expressive fine-tuning data. This suggests that targeted, label-rich expressive samples can yield high data efficiency for model development—a notable result for scaling expressive TTS in resource-constrained scenarios.
6. Alignment Methodology and Technical Resources
Text-to-audio alignment is critical for leveraging LibriQuote in TTS research. The alignment pipeline first matches chains of word-index pairs between automatic speech recognition (ASR) transcriptions and the original book text, followed by concatenated Levenshtein alignment for increased granularity. These alignment tokens enable precise mapping from text to expressive utterance, facilitating time-indexed prosody modeling.
Researchers are provided with train/dev/test splits and full evaluation code for reproducibility. Audio samples, annotations, and further technical details are available at https://libriquote.github.io/ (Michel et al., 4 Sep 2025). The dataset’s design, particularly its separation of expressive and non-expressive speech, detailed contextual metadata, and word-level timing, supports advanced modeling for expressive, zero-shot, and cross-speaker TTS.
7. Implications and Applications
LibriQuote enables both fine-tuning and benchmarking of TTS systems on expressive and emotional speech synthesis tasks. Its scale and annotation depth surpass prior expressive corpora, which have generally been limited to small benchmarking sets. Applications include:
- Improving expressive dialogue in audiobook and virtual agent speech synthesis.
- Benchmarking zero-shot TTS systems on tasks requiring both expressivity and timbral continuity.
- Research in emotion, accent, and style-transfer modeling for speech synthesis.
A plausible implication is that explicit context and prosodic labels, combined with large-scale expressive data, may facilitate the development of models capable of synthesizing speech closer to human expressivity—a gap persisting in current state-of-the-art systems, as revealed by comparative MOS and intelligibility assessments in the LibriQuote benchmark.
In summary, LibriQuote constitutes a critical resource for expressive speech modeling, offering robust data partitioning, annotated context, challenging test benchmarks, and rigorous evaluation paradigms for the advancement of expressive, zero-shot TTS technologies (Michel et al., 4 Sep 2025).