Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder (1610.04019v1)

Published 13 Oct 2016 in stat.ML, cs.LG, and cs.SD

Abstract: We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid of alignments. However, these requirements gravely limit the scope of practical applications of SC due to scarcity or even unavailability of parallel corpora. We propose an SC framework based on variational auto-encoder which enables us to exploit non-parallel corpora. The framework comprises an encoder that learns speaker-independent phonetic representations and a decoder that learns to reconstruct the designated speaker. It removes the requirement of parallel corpora or phonetic alignments to train a spectral conversion system. We report objective and subjective evaluations to validate our proposed method and compare it to SC methods that have access to aligned corpora.

Citations (296)

View on Semantic Scholar

Summary

The paper proposes a variational auto-encoder (VAE) framework for spectral conversion that works with non-parallel corpora, eliminating the need for frame alignment.
Experiments show the VAE method achieves performance comparable to baseline systems using aligned data, even with disjoint datasets.
This framework enables voice conversion from non-parallel data, expanding applicability and paving the way for potential many-to-many conversion capabilities.

Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder

The paper under review introduces an innovative framework for spectral conversion (SC), designed to utilize non-parallel corpora for voice conversion tasks. Notably, this approach leverages the capabilities of variational auto-encoders (VAEs) to facilitate spectral conversion without the prerequisite of parallel corpora or phonetic alignments. This method significantly diverges from conventional SC systems that depend on dynamic time warping (DTW) for frame alignment and highly structured datasets.

Key Contributions and Methodology

The proposed framework operates by decomposing the voice conversion task into encoding and decoding stages. The encoder is tasked with extracting speaker-independent phonetic representations from spectral frames, essentially filtering out speaker-specific information. Subsequently, the decoder reconstructs the target speaker's voice using a combination of this phonetic representation and a speaker-specific latent variable. This dual representation bypasses the need for explicit frame alignment, marking a practical advancement in leveraging non-parallel corpora for SC tasks.

From a technical perspective, the employment of VAEs introduces a probabilistic model where the latent space represents phonetic characteristics devoid of speaker identity. The encoder and decoder configurations are based on neural network architectures with ReLU activation functions, optimized using stochastic gradient descent methods.

Experimental Validation

The experiments demonstrate that this VAE-based method delivers spectral conversion performance comparable to baseline systems that rely on aligned data, such as Exemplar-based Non-negative Matrix Factorizations (ENMF). Objective evaluations using mean Mel-cepstral distortion (MCD) and subjective listening tests affirm the efficacy of the proposed framework. Notably, even when trained on disjoint datasets (VAE-disj), where no sentence overlap exists between source and target speakers, the system maintains robust performance.

Implications and Future Directions

This framework presents significant implications for the field of voice conversion, especially in scenarios where parallel data is scarce or unavailable. The ability to learn from non-parallel data expands the applicability of SC models across different linguistic and speaker datasets. Furthermore, this VAE approach hints at the possibility of many-to-many (M2M) voice conversion capabilities, potentially enabling conversion between an arbitrary number of source and target speakers.

Future research could explore integrating an additional speaker recognition network to enhance the many-to-many conversion capability. This would entail developing methods to efficiently identify speaker characteristics from limited data inputs, further broadening the utility and adaptability of voice conversion systems.

Conclusion

Overall, the paper presents a compelling advancement in voice conversion technology through the innovative use of variational auto-encoders. By eliminating the dependency on parallel data for training, this work sets a foundation for more versatile and accessible SC systems. Continued exploration and refinement of this approach could lead to significant developments in cross-linguistic and multi-speaker voice conversion applications.

PDF Markdown