Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition (2402.14523v2)

Published 22 Feb 2024 in cs.CL, cs.SD, and eess.AS

Abstract: We often verbally express emotions in a multifaceted manner, they may vary in their intensities and may be expressed not just as a single but as a mixture of emotions. This wide spectrum of emotions is well-studied in the structural model of emotions, which represents variety of emotions as derivative products of primary emotions with varying degrees of intensity. In this paper, we propose an emotional text-to-speech design to simulate a wider spectrum of emotions grounded on the structural model. Our proposed design, Daisy-TTS, incorporates a prosody encoder to learn emotionally-separable prosody embedding as a proxy for emotion. This emotion representation allows the model to simulate: (1) Primary emotions, as learned from the training samples, (2) Secondary emotions, as a mixture of primary emotions, (3) Intensity-level, by scaling the emotion embedding, and (4) Emotions polarity, by negating the emotion embedding. Through a series of perceptual evaluations, Daisy-TTS demonstrated overall higher emotional speech naturalness and emotion perceiveability compared to the baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Volker Blanz and Thomas Vetter. 2023. A morphable model for the synthesis of 3d faces. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164.
  2. The primacy of categories in the recognition of 12 emotions in speech prosody across two cultures. Nature human behaviour, 3(4):369–382.
  3. 3d morphable face models – past, present and future.
  4. Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion, 6(3-4):169–200.
  5. Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation.
  6. Denoising diffusion probabilistic models.
  7. Glow-tts: A generative flow for text-to-speech via monotonic alignment search.
  8. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.
  9. Emotional end-to-end neural speech synthesizer.
  10. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
  11. Sylvie Mozziconacci. 2002. Prosody and emotions. In Proc. Speech Prosody 2002, pages 1–9.
  12. Film: Visual reasoning with a general conditioning layer.
  13. Robert Plutchik. 1982. A psychoevolutionary theory of emotions.
  14. Robert Plutchik. 1984. Emotions: A general psychoevolutionary theory. Approaches to emotion, 1984(197-219):2–4.
  15. Grad-tts: A diffusion probabilistic model for text-to-speech.
  16. Exploring emotional prototypes in a high dimensional tts latent space. In Interspeech 2021, interspeech 2021. ISCA.
  17. U-net: Convolutional networks for biomedical image segmentation.
  18. James A Russell. 1980. A circumplex model of affect. Journal of personality and social psychology, 39(6):1161.
  19. Klaus R Scherer and Paul Ekman. 2014. Approaches to emotion. Psychology Press.
  20. Dagmar Schuller and Björn Schuller. 2020. A review on five recent and near-future developments in computational processing of emotion in the human voice. Emotion Review, 13:175407391989852.
  21. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron.
  22. Emomix: Emotion mixing via diffusion models for emotional speech synthesis.
  23. Emotion ratings: How intensity, annotation confidence and agreements are entangled.
  24. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
  25. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis.
  26. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52.
  27. Daft-exprt: Cross-speaker prosody transfer on any text for expressive speech synthesis. In Interspeech 2022, interspeech 2022. ISCA.
  28. Non-parallel sequence-to-sequence voice conversion with disentangled linguistic and speaker representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:540–552.
  29. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 920–924. IEEE.
  30. Emotional voice conversion: Theory, databases and esd.
  31. Speech synthesis with mixed emotions.
Citations (2)

Summary

  • The paper presents Daisy-TTS, a model that leverages prosody embedding decomposition to achieve rich, nuanced emotional TTS synthesis.
  • It distinguishes primary and secondary emotions by adjusting intensity and polarity, achieving superior MOS scores in evaluations.
  • The findings highlight the model’s potential to enhance human-computer interaction through more natural and varied emotional speech output.

Overview of Daisy-TTS: Simulating a Wider Spectrum of Emotions via Prosody Embedding Decomposition

The paper "Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition" presents an innovative approach to emotional text-to-speech (TTS) synthesis by leveraging prosody embedding decomposition based on the structural model of emotions. This technique offers a novel method for embedding prosody to achieve nuanced emotional expression in synthetic voices. By integrating prosody embeddings as proxies for emotions, the researchers aim to provide a richer and more varied emotional user experience than traditional TTS systems.

Key Contributions

  1. Prosody Embedding as Emotion Proxy: The paper introduces Daisy-TTS, a model that integrates a prosody encoder to learn embeddings that are emotionally separable. These embeddings allow the synthesis of speech with emotionally nuanced and perceptually distinguishable characteristics.
  2. Comprehensive Emotion Simulation: The model distinguishes emotional speech through:
    • Primary emotions learned directly from training data.
    • Secondary emotions by combining primary emotions.
    • Intensity variations by scaling emotion embeddings.
    • Polarity changes through negating emotion embeddings.
  3. Perceptual Evaluation Results: Daisy-TTS demonstrates superior performance over baseline models in both the naturalness and perceivability of emotional speech. This suggests that the incorporation of a structurally informed emotional model markedly enhances the quality and authenticity of emotional TTS outputs.

Numerical Results

The evaluation conducted includes MOS (Mean Opinion Score) tests and emotion perception tests, revealing that Daisy-TTS consistently outperforms a baseline model in simulating both primary and secondary emotions. Particularly, for secondary emotions such as ‘bittersweetness’ and ‘outrage’, Daisy-TTS achieved higher scores in both naturalness and emotion perceivability. These results underscore Daisy-TTS's capacity to synthesize complex and mixed emotional states from a text input, a feature not fully realized in prior research.

Implications and Future Directions

The implications of this research are twofold:

  1. Practical Applications: By enabling the nuanced expression of emotions, Daisy-TTS can be pivotal in enhancing user interactions with virtual assistants, accessibility tools, and human-computer interaction systems. This capability is particularly crucial in contexts where emotional intonation can significantly affect communication efficacy and user satisfaction.
  2. Theoretical Insight: The success of prosody-based emotional embeddings suggests that non-lexical features hold substantial potential for emotion modeling, enriching our understanding of speech emotion synthesis and furthering research in this domain.

Looking forward, future developments could explore the adaptation of such a model to other languages and less researched emotional dynamics, potentially requiring new datasets or modified structural models to capture culture-specific emotional nuances.

Conclusion

In conclusion, the "Daisy-TTS" paper provides a significant contribution to the field of emotional TTS by presenting a robust method for simulating a wide spectrum of emotions. The model’s ability to integrate complex emotional characteristics such as intensity, polarity, and secondary emotions positions it as a valuable tool for both industrial applications and academic inquiries into the future of AI-driven speech technologies.