Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset (2010.14794v2)

Published 28 Oct 2020 in cs.SD, cs.CL, and eess.AS

Abstract: Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which makes use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional style to a new utterance. We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework. This paper also marks the release of an emotional speech dataset (ESD) for voice conversion, which has multiple speakers and languages.

Authors (4)

Kun Zhou (217 papers)
Rui Liu (320 papers)
Haizhou Li (286 papers)
Berrak Sisman (49 papers)

Citations (165)

View on Semantic Scholar

Summary

Emotional Style Transfer for Voice Conversion Using Deep Features

The paper "Seen and Unseen Emotional Style Transfer for Voice Conversion with a New Emotional Speech Dataset" introduces a novel framework for emotional voice conversion that incorporates both seen and unseen emotions without the need for parallel training data. This endeavor aligns with the growing interest in enhancing expressiveness in speech synthesis systems and developing advanced applications in conversational agents and expressive TTS.

Methodology and Framework

The proposed framework is based on a variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), leveraging a pre-trained Speech Emotion Recognition (SER) model. This configuration enables emotional style transfer during both training and inference, allowing the system to transform emotional prosody across diverse emotional states while preserving linguistic content and speaker identity. The system is named DeepEST (Deep Emotional Style Transfer) and is characterized by a one-to-many conversion capability, supporting both seen and unseen emotional styles.

The paper details a three-stage process:

Emotion Descriptor Training: The authors use a SER model to extract deep emotional features from the input utterances. The model architecture includes a 3-D CNN, BLSTM, attention layer, and FC layer, aiming to provide discriminative utterance-level features for emotion prediction.
Encoder-Decoder Training with VAW-GAN: The encoder learns an emotion-independent latent representation decomposed into phonetic and speaker information. The decoder recomposes emotional elements by conditioning on deep emotional features and $F_0$ contours, facilitating the generation of spectral features for emotional voice conversion.
Run-time Conversion: DeepEST utilizes deep emotional features derived from SER to enable voice conversion to a target emotion, using reference emotion styles.

Dataset Contribution

In addition to the novel framework, the paper announces the release of an Emotional Speech Dataset (ESD) containing parallel utterances across multiple speakers and languages, marking a valuable resource for the speech conversion research community. This dataset provides an essential foundation for training and evaluating emotional voice conversion models under various conditions.

Experimental Results

Objective evaluations using Mel-cepstral distortion (MCD) demonstrated that the DeepEST framework consistently outperforms the baseline VAW-GAN-EVC in seen emotion transfer scenarios. Notably, the framework also achieved comparable results for unseen emotion transfer, highlighting its robustness and flexibility. Subjective evaluations, including MOS and AB preference tests, corroborate the favorable performance of DeepEST concerning speech quality and emotion similarity.

Implications and Future Directions

This research offers promising advancements for emotional voice conversion, potentially influencing expressive speech synthesis and human-machine interactions. The findings encourage further development in SER performance across emotional states and explore multi-lingual and cross-lingual applications leveraging the ESD.

The framework's generalizability to unseen emotional styles without parallel data marks a significant step in emotional voice conversion, suggesting opportunities for future research in voice synthesis systems with diverse emotional expressiveness and an expanded emotional palette. This endeavor could advance the field toward more natural and emotionally aware artificial speech systems, improving applications in storytelling, virtual assistants, and immersive digital experiences.

Overall, the paper provides a significant contribution to the field of emotional voice conversion and lays foundational work for further explorations in nuanced speech synthesis scenarios.