Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition (2402.14523v2)

Published 22 Feb 2024 in cs.CL, cs.SD, and eess.AS

Abstract: We often verbally express emotions in a multifaceted manner, they may vary in their intensities and may be expressed not just as a single but as a mixture of emotions. This wide spectrum of emotions is well-studied in the structural model of emotions, which represents variety of emotions as derivative products of primary emotions with varying degrees of intensity. In this paper, we propose an emotional text-to-speech design to simulate a wider spectrum of emotions grounded on the structural model. Our proposed design, Daisy-TTS, incorporates a prosody encoder to learn emotionally-separable prosody embedding as a proxy for emotion. This emotion representation allows the model to simulate: (1) Primary emotions, as learned from the training samples, (2) Secondary emotions, as a mixture of primary emotions, (3) Intensity-level, by scaling the emotion embedding, and (4) Emotions polarity, by negating the emotion embedding. Through a series of perceptual evaluations, Daisy-TTS demonstrated overall higher emotional speech naturalness and emotion perceiveability compared to the baseline.

References (31)

Citations (2)

View on Semantic Scholar

Summary

The paper presents Daisy-TTS, a model that leverages prosody embedding decomposition to achieve rich, nuanced emotional TTS synthesis.
It distinguishes primary and secondary emotions by adjusting intensity and polarity, achieving superior MOS scores in evaluations.
The findings highlight the model’s potential to enhance human-computer interaction through more natural and varied emotional speech output.

Overview of Daisy-TTS: Simulating a Wider Spectrum of Emotions via Prosody Embedding Decomposition

The paper "Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition" presents an innovative approach to emotional text-to-speech (TTS) synthesis by leveraging prosody embedding decomposition based on the structural model of emotions. This technique offers a novel method for embedding prosody to achieve nuanced emotional expression in synthetic voices. By integrating prosody embeddings as proxies for emotions, the researchers aim to provide a richer and more varied emotional user experience than traditional TTS systems.

Key Contributions

Prosody Embedding as Emotion Proxy: The paper introduces Daisy-TTS, a model that integrates a prosody encoder to learn embeddings that are emotionally separable. These embeddings allow the synthesis of speech with emotionally nuanced and perceptually distinguishable characteristics.
Comprehensive Emotion Simulation: The model distinguishes emotional speech through:
- Primary emotions learned directly from training data.
- Secondary emotions by combining primary emotions.
- Intensity variations by scaling emotion embeddings.
- Polarity changes through negating emotion embeddings.
Perceptual Evaluation Results: Daisy-TTS demonstrates superior performance over baseline models in both the naturalness and perceivability of emotional speech. This suggests that the incorporation of a structurally informed emotional model markedly enhances the quality and authenticity of emotional TTS outputs.

Numerical Results

The evaluation conducted includes MOS (Mean Opinion Score) tests and emotion perception tests, revealing that Daisy-TTS consistently outperforms a baseline model in simulating both primary and secondary emotions. Particularly, for secondary emotions such as ‘bittersweetness’ and ‘outrage’, Daisy-TTS achieved higher scores in both naturalness and emotion perceivability. These results underscore Daisy-TTS's capacity to synthesize complex and mixed emotional states from a text input, a feature not fully realized in prior research.

Implications and Future Directions

The implications of this research are twofold:

Practical Applications: By enabling the nuanced expression of emotions, Daisy-TTS can be pivotal in enhancing user interactions with virtual assistants, accessibility tools, and human-computer interaction systems. This capability is particularly crucial in contexts where emotional intonation can significantly affect communication efficacy and user satisfaction.
Theoretical Insight: The success of prosody-based emotional embeddings suggests that non-lexical features hold substantial potential for emotion modeling, enriching our understanding of speech emotion synthesis and furthering research in this domain.

Looking forward, future developments could explore the adaptation of such a model to other languages and less researched emotional dynamics, potentially requiring new datasets or modified structural models to capture culture-specific emotional nuances.

Conclusion

In conclusion, the "Daisy-TTS" paper provides a significant contribution to the field of emotional TTS by presenting a robust method for simulating a wide spectrum of emotions. The model’s ability to integrate complex emotional characteristics such as intensity, polarity, and secondary emotions positions it as a valuable tool for both industrial applications and academic inquiries into the future of AI-driven speech technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AlhamFikri/status/1763051330963145027

https://twitter.com/AudioAndSpeech/status/1806671810655670767