Papers
Topics
Authors
Recent
2000 character limit reached

Daisy-TTS: Emotional Text-to-Speech System

Updated 28 November 2025
  • The paper introduces Daisy-TTS, an emotional TTS system that leverages a prosody encoder to learn emotionally-separable embeddings for controlled speech synthesis.
  • It employs a diffusion-based decoder with PCA-driven manipulation, enabling flexible control over primary, mixed, intensity-scaled, and polarity-inverted emotions.
  • Empirical evaluations using MOS and emotion accuracy metrics demonstrate that Daisy-TTS outperforms prior methods in naturalness and emotion recognizability.

Daisy-TTS is an emotional text-to-speech (TTS) system designed to simulate a wide spectrum of emotions—including primary, secondary (mixed), varying intensity, and polarity—grounded on the structural model of emotions. Central to Daisy-TTS is a prosody encoder that learns emotionally-separable prosody embeddings, enabling direct and principled manipulation of emotional expression in synthesized speech. The system introduces new mathematical and architectural approaches for emotion representation and control, yielding flexible and expressive synthesis capabilities with demonstrated empirical advantages over prior methods (Chevi et al., 2024).

1. System Overview and Architectural Components

Daisy-TTS employs a multi-component neural architecture integrating the following modules:

  • Prosody Encoder G(s)G(s): Ingests non-lexical features—mel-spectrogram, pitch contour, and energy contour—of an exemplar utterance. The encoder aggregates these temporally into a fixed-size vector uRdu \in \mathbb{R}^d, representing the utterance’s prosodic and emotional content.
  • Text Encoder: Based on the Grad-TTS text encoder, it processes phoneme inputs xx into hidden features, which are further modulated via Feature-wise Linear Modulation (FiLM) conditioning using uu, thereby injecting prosodic and emotional control into the linguistic representation.
  • Diffusion-Based Decoder: Utilizes the Grad-TTS U-Net decoder, which iteratively denoises a Gaussian noise vector toward the target mel-spectrogram yy, guided by the text encoder’s output μx,u\mu_{x,u}.

During both training and synthesis, G(s)G(s) and the Grad-TTS backbone F(xu)F(x|u) are integrated end-to-end. At inference, uu can be directly manipulated using its learned decomposition rather than requiring an exemplar utterance.

2. Prosody Embedding Learning and Emotional Separability

The prosody encoder G(s)G(s) is explicitly optimized to produce embeddings uu that are emotionally distinguishable while discarding irrelevant information such as speaker identity and lexical content. This is achieved via an auxiliary classifier, attached after G(s)G(s)'s final pooling layer, which predicts the utterance's primary emotion label ϵ{joy, sadness, anger, surprise}\epsilon \in \{\text{joy, sadness, anger, surprise}\} from uu. The cross-entropy loss from this classifier is back-propagated through a gradient-reversal layer, enforcing G(s)G(s) to filter out non-emotional cues.

Empirical analysis using t-SNE visualization reveals that the resulting set of embeddings {ui}\{u_i\} clusters by primary emotion, confirming the efficacy of the separability constraint. This emotionally-structured latent space forms the basis for controlled synthesis (Chevi et al., 2024).

3. Mathematical Framework for Emotion Manipulation

Daisy-TTS frames prosody embedding manipulation through principal component analysis (PCA) of the latent space. Let URm×dU \in \mathbb{R}^{m \times d} denote all prosody embeddings from training data, with

  • uˉ\bar{u}: empirical mean of UU,
  • {vi}i=1n\{v_i\}_{i=1}^n: top-nn eigenvectors of the covariance of UU,
  • σi2\sigma_i^2: corresponding eigenvalues.

A generic embedding is parameterized as

u(w)=uˉ+i=1nwivi,u(w) = \bar{u} + \sum_{i=1}^n w_i v_i,

where wN(0,Σ)w \sim \mathcal{N}(0, \Sigma) with Σ=diag(σ12,,σn2)\Sigma = \operatorname{diag}(\sigma_1^2, \ldots, \sigma_n^2). This morphable representation supports four forms of emotion control:

  • Primary Emotion Synthesis: For emotion ϵ\epsilon, use PCA space coordinates near the cluster mean μϵ\mu^\epsilon; thus wϵN(μϵ,Σ)w^{\epsilon} \sim \mathcal{N}(\mu^\epsilon, \Sigma).
  • Secondary (Mixed) Emotions: For mixing two primary emotions ϵ1\epsilon_1 and ϵ2\epsilon_2:

Σϵs=(2Σ1)1,μϵs=Σϵs[Σ1μϵ1+Σ1μϵ2],wϵsN(μϵs,Σϵs).\Sigma^{\epsilon_s} = (2\Sigma^{-1})^{-1}, \qquad \mu^{\epsilon_s} = \Sigma^{\epsilon_s}\left[\Sigma^{-1}\mu^{\epsilon_1} + \Sigma^{-1}\mu^{\epsilon_2}\right], \qquad w^{\epsilon_s} \sim \mathcal{N}(\mu^{\epsilon_s}, \Sigma^{\epsilon_s}).

This yields embeddings for emotions such as “bittersweet” or “delight.”

  • Intensity Scaling: For ww as above, intensity modulation is achieved by scaling the deviation from the mean by α\alpha: u(w;α)=uˉ+αi=1nwiviu(w;\alpha) = \bar{u} + \alpha \sum_{i=1}^n w_i v_i. Large α\alpha increases, small α\alpha attenuates perceived intensity.
  • Polarity Inversion: Negating ww generates the opposite emotion: uneg=uˉ(u(w)uˉ)u_{\text{neg}} = \bar{u} - (u(w) - \bar{u}).

4. Training Configuration and Optimization

Daisy-TTS was trained on the Emotional Speech Dataset (ESD), which includes 10 English speakers (5M/5F), with 350 utterances per speaker for each of 4 primary emotions plus neutral speech. Speech inputs were converted to 80-band mel spectrograms at 22,050 Hz (window=1024, hop=256), with extracted pitch and energy contours fed into the prosody encoder.

The training loss comprised both the Grad-TTS diffusion loss and the cross-entropy loss for the auxiliary emotion classifier (back-propagated with gradient reversal). The system, containing approximately 14.8M parameters, was trained for 310,000 steps with a batch size of 32 using AdamW (β₁=0.9, β₂=0.99, learning rate=1e–4). The final waveform synthesis used HiFi-GAN pretrained on ESD and then fine-tuned.

5. Evaluation Methodology and Empirical Results

Evaluation involved crowdsourced perceptual studies (Amazon Mechanical Turk, HIT accuracy ≥99%), employing:

  • Speech Naturalness: Measured by Mean Opinion Score (MOS) on a 5-point Likert scale.
  • Emotion Perceivability: Annotators selected 1–2 perceived emotions (forced-choice). Metrics computed the accuracy with which listeners identified intended emotions.

Testing covered primary emotions, secondary emotions (for which no ground-truth samples exist), intensity modulation (scaling α ∈ {0.25, 1.0, 1.75}), and polarity inversion. The main baseline was the open-source mixed-emotion TTS model of Zhou et al., retrained and refined on a comparable speaker.

Key empirical findings include:

Daisy-TTS (Primary) Baseline (Primary) Daisy-TTS (Secondary) Baseline (Secondary)
MOS Range 3.69–3.84 2.82–3.36 ≈3.66–3.91 ≈3.17–3.48
Emotion Accuracy 0.30–0.53 0.14–0.53 ≈0.10–0.36 ≈0.13–0.31
  • Daisy-TTS sustained naturalness under intensity scaling (0.26\leq0.26 MOS drop), while the baseline degraded up to 0.69 points; emotion perceivability improved monotonically with α\alpha for Daisy-TTS.
  • Polarity inversion (negating emotion embedding) produced perceivable opposite emotions (e.g., “polar sadness” by inverting joy: accuracy 0.34 vs. joy at 0.33).
  • Confusion matrices indicate Daisy-TTS’s error patterns more closely resemble ground truth, reflecting more realistic emotional ambiguities.

6. Extensions Over Prior Emotional TTS Systems

Previous emotional TTS methods primarily relied on discrete emotion labels, precluding interpolation or continuous intensity control, or utilized low-dimensional valence-arousal representations lacking structural relationships present in the emotion spectrum. Some systems embedded emotions via external speech-emotion recognizers without end-to-end modeling.

Daisy-TTS is the first system to explicitly ground emotional TTS in the structural model of emotions via the discovery and manipulation of an emotionally-separable prosody embedding mapped through PCA decomposition. This enables:

  • Synthesis of learned primary emotion clusters.
  • Generation of secondary/mixed emotions as statistical mixtures.
  • Continuous control of emotional intensity.
  • Systematic polarity inversion through vector negation.

Empirically, Daisy-TTS demonstrates elevated naturalness and emotion recognizability across these capabilities compared to open-source baselines, providing a unified and principled framework for expressive, controllable emotional speech synthesis (Chevi et al., 2024).

7. Significance and Future Research Directions

Daisy-TTS establishes a scalable paradigm for emotional speech synthesis rooted in the structural relationships among emotions, emergent from learned prosodic representations. By supporting interpretable, algebraic control over the emotion space, it bridges the gap between statistical emotion models and parametric TTS architectures.

A plausible implication is that the approach could generalize to broader emotion taxonomies or other forms of expressive speech control, contingent on the availability of data and the emotional separability of the latent prosody embeddings. Further extension to end-to-end models with broader speaker and language coverage, multi-modal symmetry, or integration of other paralinguistic features presents substantive directions for future work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Daisy-TTS.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube