Daisy-TTS: Emotional Text-to-Speech System
- The paper introduces Daisy-TTS, an emotional TTS system that leverages a prosody encoder to learn emotionally-separable embeddings for controlled speech synthesis.
- It employs a diffusion-based decoder with PCA-driven manipulation, enabling flexible control over primary, mixed, intensity-scaled, and polarity-inverted emotions.
- Empirical evaluations using MOS and emotion accuracy metrics demonstrate that Daisy-TTS outperforms prior methods in naturalness and emotion recognizability.
Daisy-TTS is an emotional text-to-speech (TTS) system designed to simulate a wide spectrum of emotions—including primary, secondary (mixed), varying intensity, and polarity—grounded on the structural model of emotions. Central to Daisy-TTS is a prosody encoder that learns emotionally-separable prosody embeddings, enabling direct and principled manipulation of emotional expression in synthesized speech. The system introduces new mathematical and architectural approaches for emotion representation and control, yielding flexible and expressive synthesis capabilities with demonstrated empirical advantages over prior methods (Chevi et al., 2024).
1. System Overview and Architectural Components
Daisy-TTS employs a multi-component neural architecture integrating the following modules:
- Prosody Encoder : Ingests non-lexical features—mel-spectrogram, pitch contour, and energy contour—of an exemplar utterance. The encoder aggregates these temporally into a fixed-size vector , representing the utterance’s prosodic and emotional content.
- Text Encoder: Based on the Grad-TTS text encoder, it processes phoneme inputs into hidden features, which are further modulated via Feature-wise Linear Modulation (FiLM) conditioning using , thereby injecting prosodic and emotional control into the linguistic representation.
- Diffusion-Based Decoder: Utilizes the Grad-TTS U-Net decoder, which iteratively denoises a Gaussian noise vector toward the target mel-spectrogram , guided by the text encoder’s output .
During both training and synthesis, and the Grad-TTS backbone are integrated end-to-end. At inference, can be directly manipulated using its learned decomposition rather than requiring an exemplar utterance.
2. Prosody Embedding Learning and Emotional Separability
The prosody encoder is explicitly optimized to produce embeddings that are emotionally distinguishable while discarding irrelevant information such as speaker identity and lexical content. This is achieved via an auxiliary classifier, attached after 's final pooling layer, which predicts the utterance's primary emotion label from . The cross-entropy loss from this classifier is back-propagated through a gradient-reversal layer, enforcing to filter out non-emotional cues.
Empirical analysis using t-SNE visualization reveals that the resulting set of embeddings clusters by primary emotion, confirming the efficacy of the separability constraint. This emotionally-structured latent space forms the basis for controlled synthesis (Chevi et al., 2024).
3. Mathematical Framework for Emotion Manipulation
Daisy-TTS frames prosody embedding manipulation through principal component analysis (PCA) of the latent space. Let denote all prosody embeddings from training data, with
- : empirical mean of ,
- : top- eigenvectors of the covariance of ,
- : corresponding eigenvalues.
A generic embedding is parameterized as
where with . This morphable representation supports four forms of emotion control:
- Primary Emotion Synthesis: For emotion , use PCA space coordinates near the cluster mean ; thus .
- Secondary (Mixed) Emotions: For mixing two primary emotions and :
This yields embeddings for emotions such as “bittersweet” or “delight.”
- Intensity Scaling: For as above, intensity modulation is achieved by scaling the deviation from the mean by : . Large increases, small attenuates perceived intensity.
- Polarity Inversion: Negating generates the opposite emotion: .
4. Training Configuration and Optimization
Daisy-TTS was trained on the Emotional Speech Dataset (ESD), which includes 10 English speakers (5M/5F), with 350 utterances per speaker for each of 4 primary emotions plus neutral speech. Speech inputs were converted to 80-band mel spectrograms at 22,050 Hz (window=1024, hop=256), with extracted pitch and energy contours fed into the prosody encoder.
The training loss comprised both the Grad-TTS diffusion loss and the cross-entropy loss for the auxiliary emotion classifier (back-propagated with gradient reversal). The system, containing approximately 14.8M parameters, was trained for 310,000 steps with a batch size of 32 using AdamW (β₁=0.9, β₂=0.99, learning rate=1e–4). The final waveform synthesis used HiFi-GAN pretrained on ESD and then fine-tuned.
5. Evaluation Methodology and Empirical Results
Evaluation involved crowdsourced perceptual studies (Amazon Mechanical Turk, HIT accuracy ≥99%), employing:
- Speech Naturalness: Measured by Mean Opinion Score (MOS) on a 5-point Likert scale.
- Emotion Perceivability: Annotators selected 1–2 perceived emotions (forced-choice). Metrics computed the accuracy with which listeners identified intended emotions.
Testing covered primary emotions, secondary emotions (for which no ground-truth samples exist), intensity modulation (scaling α ∈ {0.25, 1.0, 1.75}), and polarity inversion. The main baseline was the open-source mixed-emotion TTS model of Zhou et al., retrained and refined on a comparable speaker.
Key empirical findings include:
| Daisy-TTS (Primary) | Baseline (Primary) | Daisy-TTS (Secondary) | Baseline (Secondary) | |
|---|---|---|---|---|
| MOS Range | 3.69–3.84 | 2.82–3.36 | ≈3.66–3.91 | ≈3.17–3.48 |
| Emotion Accuracy | 0.30–0.53 | 0.14–0.53 | ≈0.10–0.36 | ≈0.13–0.31 |
- Daisy-TTS sustained naturalness under intensity scaling ( MOS drop), while the baseline degraded up to 0.69 points; emotion perceivability improved monotonically with for Daisy-TTS.
- Polarity inversion (negating emotion embedding) produced perceivable opposite emotions (e.g., “polar sadness” by inverting joy: accuracy 0.34 vs. joy at 0.33).
- Confusion matrices indicate Daisy-TTS’s error patterns more closely resemble ground truth, reflecting more realistic emotional ambiguities.
6. Extensions Over Prior Emotional TTS Systems
Previous emotional TTS methods primarily relied on discrete emotion labels, precluding interpolation or continuous intensity control, or utilized low-dimensional valence-arousal representations lacking structural relationships present in the emotion spectrum. Some systems embedded emotions via external speech-emotion recognizers without end-to-end modeling.
Daisy-TTS is the first system to explicitly ground emotional TTS in the structural model of emotions via the discovery and manipulation of an emotionally-separable prosody embedding mapped through PCA decomposition. This enables:
- Synthesis of learned primary emotion clusters.
- Generation of secondary/mixed emotions as statistical mixtures.
- Continuous control of emotional intensity.
- Systematic polarity inversion through vector negation.
Empirically, Daisy-TTS demonstrates elevated naturalness and emotion recognizability across these capabilities compared to open-source baselines, providing a unified and principled framework for expressive, controllable emotional speech synthesis (Chevi et al., 2024).
7. Significance and Future Research Directions
Daisy-TTS establishes a scalable paradigm for emotional speech synthesis rooted in the structural relationships among emotions, emergent from learned prosodic representations. By supporting interpretable, algebraic control over the emotion space, it bridges the gap between statistical emotion models and parametric TTS architectures.
A plausible implication is that the approach could generalize to broader emotion taxonomies or other forms of expressive speech control, contingent on the availability of data and the emotional separability of the latent prosody embeddings. Further extension to end-to-end models with broader speaker and language coverage, multi-modal symmetry, or integration of other paralinguistic features presents substantive directions for future work.