Controlling Time-Varying Emotional States in Zero-Shot Text-to-Speech
The paper "LAUGH NOW CRY LATER: CONTROLLING TIME-VARYING EMOTIONAL STATES OF FLOW-MATCHING-BASED ZERO-SHOT TEXT-TO-SPEECH" introduces EmoCtrl-TTS, a novel approach in the field of zero-shot text-to-speech (TTS) synthesis, designed to generate speech with rich emotional content and non-verbal vocalizations (NVs), such as laughter and crying. The system leverages arousal and valence values alongside laughter embeddings to achieve a more nuanced control over the emotional states within the generated speech, a capability not extensively addressed in previous TTS systems.
Methodological Framework
EmoCtrl-TTS builds on a flow-matching-based zero-shot TTS framework by incorporating emotion and NV embeddings. The model uses a substantial dataset of over 27,000 hours of expressive real-world speech curated through pseudo-labeling, thereby overcoming the limitations typical of previous models, which relied on smaller, staged datasets. The inclusion of arousal and valence metrics offers a granular control over the emotional content, while laughter embeddings facilitate the generation of various NVs beyond laughter.
Evaluations and Results
The model's performance was evaluated using several test sets, including a Japanese-to-English speech-to-speech translation (S2ST) scenario and datasets testing the capability for fine-grained emotional transitions and response to real laughter and crying. EmoCtrl-TTS was found to significantly outperform baselines such as Voicebox and ELaTE in various metrics. Objective evaluation metrics, such as AutoPCP and Aro-Val SIM, indicated that EmoCtrl-TTS can better mimic the emotional transitions of source audio. Subjective evaluations further supported these findings, wherein EmoCtrl-TTS achieved higher scores in metrics like naturalness and emotion similarity. However, a moderate degradation in word error rates (WER) was noted in certain scenarios, suggesting room for enhancements in intelligibility alongside emotion control.
Implications and Future Work
By enabling more expressive and emotionally rich speech synthesis, EmoCtrl-TTS has notable implications for applications requiring nuanced emotional content, such as assistive technologies, entertainment, and advanced human-computer interactions. The methodology presented highlights the importance of large-scale, real-world datasets and sophisticated emotion representations like arousal-valence space for effective TTS models. Future work could focus on improving the WER and exploring the use of other emotional dimensions such as dominance for even more refined control. Additionally, research could extend toward adaptive learning mechanisms where the system dynamically adjusts emotional outputs based on contextual cues and feedback.
Ultimately, EmoCtrl-TTS represents an important advancement in TTS systems, pushing the boundaries of how synthetic speech can convey human-like emotional depth, thereby setting a new benchmark in zero-shot TTS synthesis.