CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion (1904.04631v1)

Published 9 Apr 2019 in cs.SD, cs.LG, eess.AS, and stat.ML

Abstract: Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any extra data, modules, or time alignment procedures. However, there is still a large gap between the real target and converted speech, and bridging this gap remains a challenge. To reduce this gap, we propose CycleGAN-VC2, which is an improved version of CycleGAN-VC incorporating three new techniques: an improved objective (two-step adversarial losses), improved generator (2-1-2D CNN), and improved discriminator (PatchGAN). We evaluated our method on a non-parallel VC task and analyzed the effect of each technique in detail. An objective evaluation showed that these techniques help bring the converted feature sequence closer to the target in terms of both global and local structures, which we assess by using Mel-cepstral distortion and modulation spectra distance, respectively. A subjective evaluation showed that CycleGAN-VC2 outperforms CycleGAN-VC in terms of naturalness and similarity for every speaker pair, including intra-gender and inter-gender pairs.

Citations (241)

View on Semantic Scholar

Summary

The paper introduces a two-step adversarial loss that enriches voice texture and prevents over-smoothing in converted speech.
It proposes a novel 2-1-2D CNN generator architecture that balances temporal dependencies with local feature fidelity.
The CycleGAN-VC2 framework achieves significant improvements in both objective (MCD) and subjective (MOS) evaluations on the VCC 2018 dataset.

Enhanced Non-Parallel Voice Conversion with CycleGAN-VC2

The paper "CycleGAN-VC2: Improved CycleGAN-Based Non-Parallel Voice Conversion" advances the field of non-parallel voice conversion (VC) by presenting a new framework, CycleGAN-VC2. This work builds upon the original CycleGAN-VC by introducing several key enhancements that improve conversion quality without requiring parallel data, a significant advantage due to the difficulty and expense associated with obtaining such data.

Enhancements in CycleGAN-VC2

CycleGAN-VC2 incorporates three novel techniques that refine both the generator and discriminator components and enhance the loss function:

Two-Step Adversarial Losses: To address over-smoothing and promote a richer texture in the converted voice, the authors propose a two-step adversarial loss. This modification applies adversarial losses twice for each conversion cycle, which produces more distinct and natural-sounding outputs.
Improved Generator Architecture (2-1-2D CNN): A new generator was conceived using a 2-1-2D Convolutional Neural Network architecture. This structure effectively captures wide-range temporal dependencies while preserving spatial or local fidelity, balancing the spatial features captured by 2D convolution with the temporal details better managed by 1D convolution.
PatchGAN Discriminator: Utilizing a PatchGAN configuration enhances the discriminator's efficiency and keeps the learning process stable by focusing on local image patches instead of evaluating the entire image. This helps in reducing the model parameters and thereby minimizing potential training challenges.

Evaluation and Results

The authors conducted extensive experiments using the Voice Conversion Challenge (VCC) 2018 dataset to assess the performance of CycleGAN-VC2 compared to the preceding CycleGAN-VC and a frame-based CycleGAN. Both objective (Mel-cepstral distortion (MCD)) and subjective measures (MOS for naturalness and speaker similarity) demonstrated that CycleGAN-VC2 excels in delivering superior-quality converted speech. Notably, the proposed framework demonstrated significant improvements in both intra-gender and inter-gender conversion scenarios:

Objective Evaluation: The CycleGAN-VC2 achieved a MCD reduction, underscoring enhanced similarity between converted and target speech. This metric is essential for quantifying spectral distortion, a crucial aspect of maintaining the naturalness and intelligibility of converted speech.
Subjective Evaluation: CycleGAN-VC2 consistently outperformed CycleGAN-VC in listener tests, achieving higher scores in both naturalness and speaker similarity. Such results suggest the proposed modifications lead to perceptible improvements in the quality of the conversion.

Implications and Future Directions

The proposed CycleGAN-VC2 framework represents a productive advance in voice conversion without the limitations of requiring parallel data. The improvements in both objective metrics and listener evaluations indicate a substantial stride towards achieving more natural voice transformations. Importantly, these advancements can be generalized beyond one-to-one conversions and may be adapted for broader domains, such as multi-domain VC or applications beyond voice conversion, continuing to propel research in non-parallel learning methods. Further studies might explore the integration of advanced vocoders or multi-domain training to deepen these applications.

In conclusion, CycleGAN-VC2 not only enhances CycleGAN-based voice conversion but also opens avenues for efficient and robust non-parallel voice conversion systems, contributing significant value to the areas of speech processing and machine learning.

PDF Markdown