Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks (1704.00849v3)

Published 4 Apr 2017 in cs.CL

Abstract: Building a voice conversion (VC) system from non-parallel speech corpora is challenging but highly valuable in real application scenarios. In most situations, the source and the target speakers do not repeat the same texts or they may even speak different languages. In this case, one possible, although indirect, solution is to build a generative model for speech. Generative models focus on explaining the observations with latent variables instead of learning a pairwise transformation function, thereby bypassing the requirement of speech frame alignment. In this paper, we propose a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) that explicitly considers a VC objective when building the speech model. Experimental results corroborate the capability of our framework for building a VC system from unaligned data, and demonstrate improved conversion quality.

Authors (5)

Chin-Cheng Hsu (10 papers)
Hsin-Te Hwang (9 papers)
Yi-Chiao Wu (42 papers)
Yu Tsao (200 papers)
Hsin-Min Wang (97 papers)

Citations (311)

View on Semantic Scholar

Summary

The paper introduces a novel VAW-GAN framework that performs voice conversion using unaligned corpora.
It decouples phonetic content from speaker characteristics by combining a variational autoencoder with a Wasserstein GAN.
Subjective evaluations on the VC Challenge 2016 dataset demonstrate significant improvements in naturalness over traditional methods.

Essay on Voice Conversion from Unaligned Corpora Using VAW-GAN

The paper by Chin-Cheng Hsu et al. presents a novel framework for voice conversion (VC) that leverages Variational Autoencoding Wasserstein Generative Adversarial Networks (VAW-GAN). This approach addresses the significant challenge of non-parallel VC, where the source and target speakers do not share parallel datasets. Traditional VC systems operate effectively with parallel corpora, necessitating that both speakers repeat identical phrases, a condition not always feasible in real-world applications. The proposed method circumvents the limitations of conventional systems by relying on generative models capable of learning from unaligned speech data.

Framework Overview

The VAW-GAN framework integrates several intricate components, notably a variational autoencoder (VAE) and a Wasserstein generative adversarial network (W-GAN). The VAE component is responsible for encoding the speaker-independent phonetic content, while the W-GAN ensures that the generated voice maintains high perceptual quality compared to target speech. The synergy of these models allows the framework to avoid explicit alignment or clustering of speech data frames, which is essential for achieving non-parallel VC.

Model Formulation

The VAW-GAN divides the VC process into two primary stages. Initially, a speaker-independent encoder extracts latent phonetic representations from spectral frames. Subsequently, a speaker-dependent decoder synthesizes these frames into target speech. This decomposition is expressed mathematically, where the VC function aims to approximate the target data distribution as closely as possible.

In the context of generative models, the incorporation of the W-GAN introduces a non-parallel VC loss that directly optimizes for high-quality spectral output. The Wasserstein objective is chosen over the conventional Jensen-Shannon divergence because it more robustly models the distributional distance between real and generated data.

Experimental Results

The authors evaluate their framework using the Voice Conversion Challenge 2016 dataset, focusing on both intra-gender and inter-gender conversion tasks. Notably, the subjective evaluations via Mean Opinion Score (MOS) reveal a significant improvement in naturalness for the VAW-GAN outputs compared to the VAE baseline. However, objective measures, such as mel-cepstral distortion, exhibited inconsistencies with subjective results, highlighting a common challenge in speech synthesis evaluation that merits further investigation.

Implications and Future Directions

This paper establishes a framework that not only performs effective VC without parallel data but also considers the detailed spectral characteristics necessary for intelligibility and speech quality. The potential applications of such technology span across personalized voice assistants, cross-lingual speech synthesis, and more personalized communication tools.

Future research directions could explore alternative probabilistic graphical models (PGMs) to better decompose speaker and phonetic content, possibly enhancing speaker similarity in the synthesized voice. Reflecting on the current framework, further advancements might investigate the dynamic modeling of speaker representation to accommodate more complex linguistic features, potentially augmenting the system's applicability across diverse languages and tonal variations.

In summary, the paper provides a comprehensive exploration of the challenges faced in non-parallel VC and proposes a methodologically sound solution through the VAW-GAN. It sets a benchmark for future research aiming to develop more generalized and adaptable voice conversion systems.

PDF Markdown