Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations (1804.02812v2)

Published 9 Apr 2018 in eess.AS, cs.CL, and cs.SD

Abstract: Recently, cycle-consistent adversarial network (Cycle-GAN) has been successfully applied to voice conversion to a different speaker without parallel data, although in those approaches an individual model is needed for each target speaker. In this paper, we propose an adversarial learning framework for voice conversion, with which a single model can be trained to convert the voice to many different speakers, all without parallel data, by separating the speaker characteristics from the linguistic content in speech signals. An autoencoder is first trained to extract speaker-independent latent representations and speaker embedding separately using another auxiliary speaker classifier to regularize the latent representation. The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance. The quality of decoder output is further improved by patching with the residual signal produced by another pair of generator and discriminator. A target speaker set size of 20 was tested in the preliminary experiments, and very good voice quality was obtained. Conventional voice conversion metrics are reported. We also show that the speaker information has been properly reduced from the latent representations.

Citations (132)

View on Semantic Scholar

Summary

The paper introduces a novel adversarial framework that disentangles linguistic content from speaker identity for voice conversion without parallel data.
It employs a two-stage method using an autoencoder with classifier regularization and a GAN to enhance speech naturalness and speaker fidelity.
Experimental results on the VCTK dataset demonstrate improved linguistic fidelity and speaker similarity compared to Cycle-GAN based models.

Multi-Target Voice Conversion Using Disentangled Representations

The paper "Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations" addresses the challenge of voice conversion (VC), aiming to transform an individual's speech to sound as if produced by a different speaker without altering the linguistic content. This task is traditionally hampered by two significant issues: the necessity for aligned, parallel data and the tendency towards signal over-smoothing, which degrades output quality. The authors propose a novel adversarial learning framework leveraging disentangled audio representations to perform VC without requiring parallel data, thus tackling these longstanding hurdles.

Methodology

The proposed approach departs from earlier models like Cycle-GAN, which require separate models for each target speaker, by employing a single, versatile model. This model is capable of translating speech to any speaker in a predefined set through a process that involves disentangling speaker characteristics from linguistic content. The process is divided into two stages:

Stage 1: Autoencoder with Classifier-1 Regularization
- Autoencoder: An autoencoder is employed to extract speaker-independent latent representations from audio signals. The encoder decouples linguistic content from speaker identity. Subsequently, a decoder reconstructs speech using these latent representations in combination with target speaker embeddings.
- Classifier-1: A jointly trained auxiliary speaker classifier is used to ensure the latent representations remain independent of speaker characteristics. This classifier tries to identify the speaker from the latent features, while the encoder is adversarially trained to prevent this, promoting speaker-independence in the encoded features.
Stage 2: Enhancing Output Quality with GAN
- A separate generator is tasked with producing a residual signal to refine the output of the first stage's decoder, enhancing the naturalness and perceptual quality.
- This generator operates under the supervision of a discriminator, trained adversarially, facilitating realistic voice synthesis. Additionally, a secondary classifier assists in ensuring the synthesized speech retains the intended speaker's identity.

Experimental Results and Evaluation

The authors conducted experiments on the VCTK dataset, including 20 speakers, without relying on parallel data. Results demonstrate the framework's ability to achieve high quality in both linguistic fidelity and speaker similarity across various conversions (male-to-male, female-to-male, etc.). Objective evaluations using global variance analysis confirmed that the proposed method generates clearer and more natural speech spectra compared to models lacking these disentangling features. Subjective tests corroborating these results were favorable, indicating perceptible improvements in naturalness and speaker likeness compared to other benchmark methods, including a re-implementation of Cycle-GAN-VC.

Implications and Future Work

This investigation presents a transformative step towards streamlined, parallel-data-free voice conversion, pivotal for applications in speech synthesis, entertainment, and personalized virtual assistants. The disentangled representation strategy fosters enhanced model flexibility and resource efficiency. Future research could extend this framework to broader speaker sets, integrate more fine-grained emotional or style controls, and explore cross-linguistic voice transformations. Furthermore, advancements in disentangled representation learning might pave the way for improved robustness and naturalness in various speech-related tasks. The single model's adaptability for multi-speaker VC, enabled by adversarial learning, positions this work as a noteworthy contribution to the field's evolution.

PDF Markdown

Related Papers

YouTube

Show All Videos