Papers
Topics
Authors
Recent
Search
2000 character limit reached

Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition

Published 22 Feb 2024 in cs.CL, cs.SD, and eess.AS | (2402.14523v2)

Abstract: We often verbally express emotions in a multifaceted manner, they may vary in their intensities and may be expressed not just as a single but as a mixture of emotions. This wide spectrum of emotions is well-studied in the structural model of emotions, which represents variety of emotions as derivative products of primary emotions with varying degrees of intensity. In this paper, we propose an emotional text-to-speech design to simulate a wider spectrum of emotions grounded on the structural model. Our proposed design, Daisy-TTS, incorporates a prosody encoder to learn emotionally-separable prosody embedding as a proxy for emotion. This emotion representation allows the model to simulate: (1) Primary emotions, as learned from the training samples, (2) Secondary emotions, as a mixture of primary emotions, (3) Intensity-level, by scaling the emotion embedding, and (4) Emotions polarity, by negating the emotion embedding. Through a series of perceptual evaluations, Daisy-TTS demonstrated overall higher emotional speech naturalness and emotion perceiveability compared to the baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Volker Blanz and Thomas Vetter. 2023. A morphable model for the synthesis of 3d faces. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164.
  2. The primacy of categories in the recognition of 12 emotions in speech prosody across two cultures. Nature human behaviour, 3(4):369–382.
  3. 3d morphable face models – past, present and future.
  4. Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion, 6(3-4):169–200.
  5. Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation.
  6. Denoising diffusion probabilistic models.
  7. Glow-tts: A generative flow for text-to-speech via monotonic alignment search.
  8. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.
  9. Emotional end-to-end neural speech synthesizer.
  10. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
  11. Sylvie Mozziconacci. 2002. Prosody and emotions. In Proc. Speech Prosody 2002, pages 1–9.
  12. Film: Visual reasoning with a general conditioning layer.
  13. Robert Plutchik. 1982. A psychoevolutionary theory of emotions.
  14. Robert Plutchik. 1984. Emotions: A general psychoevolutionary theory. Approaches to emotion, 1984(197-219):2–4.
  15. Grad-tts: A diffusion probabilistic model for text-to-speech.
  16. Exploring emotional prototypes in a high dimensional tts latent space. In Interspeech 2021, interspeech 2021. ISCA.
  17. U-net: Convolutional networks for biomedical image segmentation.
  18. James A Russell. 1980. A circumplex model of affect. Journal of personality and social psychology, 39(6):1161.
  19. Klaus R Scherer and Paul Ekman. 2014. Approaches to emotion. Psychology Press.
  20. Dagmar Schuller and Björn Schuller. 2020. A review on five recent and near-future developments in computational processing of emotion in the human voice. Emotion Review, 13:175407391989852.
  21. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron.
  22. Emomix: Emotion mixing via diffusion models for emotional speech synthesis.
  23. Emotion ratings: How intensity, annotation confidence and agreements are entangled.
  24. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
  25. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis.
  26. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52.
  27. Daft-exprt: Cross-speaker prosody transfer on any text for expressive speech synthesis. In Interspeech 2022, interspeech 2022. ISCA.
  28. Non-parallel sequence-to-sequence voice conversion with disentangled linguistic and speaker representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:540–552.
  29. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 920–924. IEEE.
  30. Emotional voice conversion: Theory, databases and esd.
  31. Speech synthesis with mixed emotions.
Citations (2)

Summary

  • The paper presents an innovative TTS framework that decomposes prosody embeddings to simulate a wide range of primary and secondary emotions.
  • It employs a specialized prosody encoder and t-SNE for dimensionality reduction, ensuring distinct emotion clusters and varied intensity control.
  • Evaluation shows Daisy-TTS outperforms baselines in naturalness and emotion perceivability, marking advances in expressive TTS synthesis.

Daisy-TTS: Simulating a Wider Spectrum of Emotions via Prosody Embedding Decomposition

Introduction

Daisy-TTS introduces an innovative framework for emotional text-to-speech synthesis grounded in the structural model of emotions. This model addresses the limitations of traditional TTS systems by enabling the simulation of a wide range of emotional expressions. The structural model predominantly relies on the decomposition of prosody embeddings, which are treated as proxies for diverse emotional states, allowing the generation of both primary and secondary emotions, along with variations in intensity and polarity.

The structural model offers a comprehensive approach to emotion representation by integrating primary emotional states and their mixtures. The visual representation of the structural model in a flower-like arrangement succinctly illustrates this concept, where petals and their intersections represent primary and secondary emotions, respectively (Figure 1). Figure 1

Figure 1: Visual Representation of the Structural Model of Emotions.

Learning Emotionally-Separable Prosody Embeddings

Daisy-TTS employs a prosody encoder that extracts emotionally-separable prosody embeddings from non-lexical speech features, such as mel-spectrogram, pitch, and energy contours. This encoder, integrated with an emotion classifier, is fundamental to ensuring the emotions are distinct and separable within the latent space. The systemic overview of Daisy-TTS elucidates how these embeddings are used to condition the TTS backbone model and simulate varied emotional expressions (Figure 2). Figure 2

Figure 2: Systemic Overview of Daisy-TTS. Emotionally-separable prosody embeddings were learned from a set of speech features and used to condition a TTS backbone model. To simulate a wider range of emotion characteristics, such as intensity, polarity, and mixture of emotions, an embedding decomposition was applied to the learned embeddings.

Emotion Simulation through Prosody Embedding

Daisy-TTS's simulation of emotions leverages the separability of prosody embeddings. By applying techniques such as t-SNE for dimensionality reduction, the distinct clusters of primary emotions are demonstrated, showing clear separability which is pivotal for realistic emotion synthesis (Figure 3). Figure 3

Figure 3: Emotionally-separable prosody embeddings learned from our proposed model, Daisy-TTS. Emotions bordered in black denote primary emotions, while ones bordered in white denote secondary emotions derived from the mixture of primary ones.

Primary and Secondary Emotions

The decomposition of prosody embeddings permits the synthesis of both primary and secondary emotions. Primary emotions are simulated by focusing on emotion clusters within the embedding space, while secondary emotions are represented as mixtures of primary emotions, allowing the model to generate complex emotional states such as envy or pride.

Intensity and Polarity of Emotions

The intensity of emotion is manipulated by scaling the prosody embeddings, which affects the perceived strength of the emotional expression. Polarity is achieved by negating the embeddings, enabling the synthesis of emotions opposite to those presented, such as flipping joy to convey sadness.

Evaluation and Results

Daisy-TTS was subjected to a series of perceptual evaluations, measuring speech naturalness (MOS) and emotion perceivability, where it consistently outperformed baseline models across primary and secondary emotions (Figure 4). Figure 4

Figure 4: Result of emotion perception test for different intensity-level of primary emotions.

This evaluation highlights Daisy-TTS's competency in maintaining speech naturalness and accuracy in emotion conveyance. The system exhibited robustness in expressing emotions at different intensity levels, showcasing flexibility and adaptability, which are crucial for real-world applications.

Discussion

The introduction of Daisy-TTS opens new avenues in emotional TTS modeling by addressing the complexity and richness of human emotional expression, as reflected in the structural model of emotions. The method shows promise for applications requiring subtle and varied emotional nuances, such as virtual assistants and interactive media.

Conclusion

Daisy-TTS marks a significant advancement in TTS systems by successfully expanding the expressive capabilities of synthetic speech through prosody embedding decomposition. By establishing a robust framework for simulating a wide spectrum of emotions, it sets the stage for future developments that could further refine and extend the model's capabilities to encompass even broader emotional and cultural dynamics.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 180 likes about this paper.