Emotional Style Transfer for Voice Conversion Using Deep Features
The paper "Seen and Unseen Emotional Style Transfer for Voice Conversion with a New Emotional Speech Dataset" introduces a novel framework for emotional voice conversion that incorporates both seen and unseen emotions without the need for parallel training data. This endeavor aligns with the growing interest in enhancing expressiveness in speech synthesis systems and developing advanced applications in conversational agents and expressive TTS.
Methodology and Framework
The proposed framework is based on a variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), leveraging a pre-trained Speech Emotion Recognition (SER) model. This configuration enables emotional style transfer during both training and inference, allowing the system to transform emotional prosody across diverse emotional states while preserving linguistic content and speaker identity. The system is named DeepEST (Deep Emotional Style Transfer) and is characterized by a one-to-many conversion capability, supporting both seen and unseen emotional styles.
The paper details a three-stage process:
- Emotion Descriptor Training: The authors use a SER model to extract deep emotional features from the input utterances. The model architecture includes a 3-D CNN, BLSTM, attention layer, and FC layer, aiming to provide discriminative utterance-level features for emotion prediction.
- Encoder-Decoder Training with VAW-GAN: The encoder learns an emotion-independent latent representation decomposed into phonetic and speaker information. The decoder recomposes emotional elements by conditioning on deep emotional features and F0 contours, facilitating the generation of spectral features for emotional voice conversion.
- Run-time Conversion: DeepEST utilizes deep emotional features derived from SER to enable voice conversion to a target emotion, using reference emotion styles.
Dataset Contribution
In addition to the novel framework, the paper announces the release of an Emotional Speech Dataset (ESD) containing parallel utterances across multiple speakers and languages, marking a valuable resource for the speech conversion research community. This dataset provides an essential foundation for training and evaluating emotional voice conversion models under various conditions.
Experimental Results
Objective evaluations using Mel-cepstral distortion (MCD) demonstrated that the DeepEST framework consistently outperforms the baseline VAW-GAN-EVC in seen emotion transfer scenarios. Notably, the framework also achieved comparable results for unseen emotion transfer, highlighting its robustness and flexibility. Subjective evaluations, including MOS and AB preference tests, corroborate the favorable performance of DeepEST concerning speech quality and emotion similarity.
Implications and Future Directions
This research offers promising advancements for emotional voice conversion, potentially influencing expressive speech synthesis and human-machine interactions. The findings encourage further development in SER performance across emotional states and explore multi-lingual and cross-lingual applications leveraging the ESD.
The framework's generalizability to unseen emotional styles without parallel data marks a significant step in emotional voice conversion, suggesting opportunities for future research in voice synthesis systems with diverse emotional expressiveness and an expanded emotional palette. This endeavor could advance the field toward more natural and emotionally aware artificial speech systems, improving applications in storytelling, virtual assistants, and immersive digital experiences.
Overall, the paper provides a significant contribution to the field of emotional voice conversion and lays foundational work for further explorations in nuanced speech synthesis scenarios.