Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models (2409.06451v1)

Published 10 Sep 2024 in cs.SD and eess.AS

Abstract: While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a text-audio encoder inspired by ParaCLAP, a contrastive language-audio pretraining (CLAP) model for computational paralinguistics, the diffusion model is trained to generate emotional embeddings based on textual emotional style descriptions. Our framework first trains on reference audio using the audio encoder, then fine-tunes a diffusion model to process textual inputs from ParaCLAP's text encoder. During inference, speech attributes such as pitch, jitter, and loudness are manipulated using only textual conditioning. Our experiments demonstrate that ParaEVITS effectively control emotion rendering without compromising speech quality. Speech demos are publicly available.

Authors (4)

Xin Jing (29 papers)
Kun Zhou (217 papers)
Andreas Triantafyllopoulos (42 papers)
Björn W. Schuller (153 papers)

Citations (2)

View on Semantic Scholar

Summary

Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance

The paper "Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance Through Contrastive Learning and Diffusion Models" introduces ParaEVITS, a new framework for emotional text-to-speech (TTS) synthesis that leverages contrastive learning and diffusion models to refine the emotional expressiveness and controllability of synthesized speech. The authors target a critical challenge in emotional TTS—that of achieving fine-grained control over emotional rendering—by moving away from traditional fixed-set approaches and instead adopting a multimodal strategy integrating natural language prompts and computational paralinguistics (CP) features.

Technical Contributions

The framework introduces several methodological advancements:

Compositional Approach: ParaEVITS employs natural language to guide the synthesis process, circumventing the limitations of using manually annotated captions and fixed emotion label sets. Instead, it utilizes emotional style descriptions derived from text-audio pairs, enhancing the granularity of control over synthesized emotions.
Diffusion and Contrastive Learning Models: The paper employs a diffusion model trained on text-audio encoder outputs, guided by prompts generated via computational paralinguistic tasks. These enhance the portrayal of non-verbal vocal cues, effectively bridging low-level acoustic attributes and high-level emotional descriptions.
Integration of CP Task Features: By incorporating CP features, such as pitch, loudness, and emotion labels, the framework offers comprehensive control over the emotional attributes of synthesized speech.
Paradigm Shift in TTS Control: The use of CP-based prompts unlocks a broader spectrum of expressivity, aligning the synthesized speech with a wide array of paralinguistic phenomena, such as sincerity, personality, and nuanced emotions, going beyond the confines of traditional approaches.

Experimental Highlights

The paper evaluates the efficacy of ParaEVITS on publicly available datasets like MSP-Podcast and the Emotional Speech Database (ESD).
Quantitative and qualitative assessments show that the system improves the controllability of emotional speech without detracting from intelligibility and quality.
A notable experiment involving the Quality Mean Opinion Score (MOS) and Emotion Similarity (MOS-S) indicates that ParaEVITS holds promise, consistently outperforming baseline systems like MixedEmotion.

Implications and Speculations

The implications of ParaEVITS serve both practical and theoretical domains. Practically, the approach opens possibilities for more sophisticated virtual assistants and interactive systems, which can now adapt more dynamically to the affective cues of the user. Theoretically, the exploration of diffusion models within TTS synthesizes an enhanced understanding of multimodal learning's potential to drive innovation in affective AI.

Future developments could involve expanding the framework to incorporate more diverse datasets, including multi-speaker environments, determining its adaptability to a variety of emotional contexts. Additionally, integrating state-of-the-art transformer-based text encoders may further amplify the precision of emotion rendering, addressing current limitations observed in certain emotional categories like 'happy' speech.

Overall, ParaEVITS marks a significant step forward in the drive towards finely-tuned, emotion-aware TTS systems, likely motivating ongoing studies in both the computational paralinguistics and affective computing disciplines.

PDF Markdown