Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks (1711.11293v2)

Published 30 Nov 2017 in stat.ML, cs.SD, and eess.AS

Abstract: We propose a parallel-data-free voice-conversion (VC) method that can learn a mapping from source to target speech without relying on parallel data. The proposed method is general purpose, high quality, and parallel-data free and works without any extra data, modules, or alignment procedure. It also avoids over-smoothing, which occurs in many conventional statistical model-based VC methods. Our method, called CycleGAN-VC, uses a cycle-consistent adversarial network (CycleGAN) with gated convolutional neural networks (CNNs) and an identity-mapping loss. A CycleGAN learns forward and inverse mappings simultaneously using adversarial and cycle-consistency losses. This makes it possible to find an optimal pseudo pair from unpaired data. Furthermore, the adversarial loss contributes to reducing over-smoothing of the converted feature sequence. We configure a CycleGAN with gated CNNs and train it with an identity-mapping loss. This allows the mapping function to capture sequential and hierarchical structures while preserving linguistic information. We evaluated our method on a parallel-data-free VC task. An objective evaluation showed that the converted feature sequence was near natural in terms of global variance and modulation spectra. A subjective evaluation showed that the quality of the converted speech was comparable to that obtained with a Gaussian mixture model-based method under advantageous conditions with parallel and twice the amount of data.

Citations (197)

View on Semantic Scholar

Summary

The paper introduces CycleGAN-VC, a novel method that leverages cycle-consistent adversarial networks to convert voices without requiring parallel training data.
It employs gated CNNs and an identity-mapping loss to effectively capture speech structures and reduce the over-smoothing seen in traditional methods.
Evaluations show improved global variance and natural-sounding audio, demonstrating competitive performance even with limited training data.

Overview of Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

This paper presents an innovative approach for voice conversion (VC) using Cycle-Consistent Adversarial Networks (CycleGANs) without relying on parallel data, termed CycleGAN-VC. Developed by Takuhiro Kaneko and Hirokazu Kameoka from NTT Communication Science Laboratories, this method aims to mitigate issues such as the over-smoothing effect commonly observed with conventional Gaussian Mixture Model (GMM)-based methods.

Key Features and Methodology

The method leverages CycleGAN, initially created for image-to-image translation using unpaired data, and adapts it to VC tasks. A noteworthy aspect of this approach is its use of gated convolutional neural networks (CNNs) and an identity-mapping loss, enabling it to learn transformations from source to target speech effectively.

CycleGAN Architecture: The core idea revolves around using adversarial and cycle-consistency losses to simultaneously learn forward and inverse mappings. This framework finds an optimal pseudo pair from the unpaired data, which significantly reduces the need for parallel training pairs.
Gated CNNs: They enhance the model's ability to capture sequential and hierarchical structures within speech data, which are critical for maintaining linguistic content.
Identity-Mapping Loss: This feature aids in preserving the linguistic information during the conversion process by promoting the identity mapping of the data, thereby discouraging the generator from altering structures unnecessarily.

Results and Evaluation

The method was rigorously evaluated using the Voice Conversion Challenge 2016 dataset. The results from both objective and subjective evaluations are noteworthy:

Objective Evaluation: Quantitative analysis showed improvements in metrics such as global variance (GV) and modulation spectra (MS), indicating less over-smoothing and more natural-sounding converted speech features compared to GMM-based approaches.
Subjective Evaluation: Listening tests confirmed that the audio quality was comparable to VC systems trained on parallel data. This is significant, considering the method was tested under less favorable conditions, i.e., using half the amount of data of competing methods.

Implications and Future Directions

The research establishes a framework for VC that operates independently of aligned parallel datasets, thus broadening the applicability of VC systems to scenarios where acquiring such data is impractical or infeasible. The proposed CycleGAN-VC could potentially revolutionize various VC applications such as text-to-speech (TTS) systems, voice personalization, and more.

Looking ahead, further refinement of this approach could involve experimenting with other acoustic features and extending the methodology to vocoder-free speech synthesis frameworks to improve the overall quality of the converted speech. The paper suggests that the CycleGAN-VC framework may offer extensive utility across numerous VC-related applications beyond just speaker conversion and could thus significantly influence the future trajectory of research and development in the VC domain.

PDF Markdown