Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction (2005.07884v1)

Published 16 May 2020 in eess.AS and cs.SD

Abstract: Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related suprasegmental information simultaneously along with traditional phone features.The proposed framework uses two encoders such that the F0 trajectory and speech waveform are both input to the system, therefore two separate codebooks are learned. We used a WaveRNN vocoder as the decoder component of VQ-VAE. Our speaker-independent VQ-VAE was trained with raw speech waveforms from multi-speaker Japanese speech databases. Experimental results show that the proposed extension reduces F0 distortion of reconstructed speech for all unseen test speakers, and results in significantly higher preference scores from a listening test. We additionally conducted experiments using single-speaker Mandarin speech to demonstrate advantages of our architecture in another language which relies heavily on F0.

Authors (6)

Yi Zhao (222 papers)
Haoyu Li (56 papers)
Cheng-I Lai (13 papers)
Jennifer Williams (19 papers)
Erica Cooper (46 papers)
Junichi Yamagishi (178 papers)

Citations (18)

View on Semantic Scholar

Summary

An Analysis of Speaker and Emotional Voice Conversion using VQ-VAE

This paper presents an investigation into the enhancement of voice conversion systems through the application of Vector Quantized Variational Autoencoders (VQ-VAE) architecture. The focus is primarily on addressing the challenges of maintaining speaker identity and prosody in converted speech. By integrating acoustic feature embeddings and exploring optimal speaker representations, the authors aim to develop a system capable of high-quality, flexible voice conversion that can manage various aspects of speech such as speaker identity, gender, emotion, and prosodic elements.

Introduction

The central objective of voice conversion is to modify an utterance from a source speaker to sound as if it was produced by a target speaker. However, many contemporary voice conversion techniques suffer from issues such as distortion of the speaker identity and the prosody. The VQ-VAE model, as an extension to the classic VAE, has shown considerable promise in recent advancements within the field of voice conversion. Unlike traditional autoencoders and VAEs, VQ-VAE employs discrete latent representations, which augment its utility in handling high-level features like speech semantics.

Methodology

The authors propose several methodologies within the VQ-VAE framework to improve voice conversion quality. Key to their approach is the use of speaker representations or embeddings, such as speaker identity vectors and emotional embeddings, to delineate and control speaker identity and speaking styles in the conversion process. They employ an encoder network that learns from the input speech signals and maps them to latent embeddings through an embedding dictionary. This latent space is discretized, allowing for better reconstruction by the decoder.

Further, the paper explores the integration of acoustic features such as Fundamental Frequency (F0) and spectral encodings into this architecture. By quantizing these features and incorporating them into the VQ-VAE framework, the authors aim to systematically control the prosody and enhance speech quality.

Experimental Setting and Results

The authors conducted experiments with various datasets and configurations to validate their approach. While detailed results are not specifically outlined within the provided abstract, the emphasis on embedding acoustic features is postulated to improve flexibility and quality in the voice conversion system, addressing criticisms directed at previous models on issues of speaker identity distortion and prosodic inaccuracies.

Discussion and Implications

The research provides substantial insights into the enhancement of voice conversion technologies. By developing a system that permits control over multiple dimensions of speech, such as speaker identity and emotion, the approach offered in this paper addresses significant shortcomings in traditional voice conversion frameworks.

The implications for this work are manifold. Practically, it can lead to voice conversion systems that offer not just improved fidelity in speaker identity conversion but also finely-tuned control over prosodic and emotional attributes. Theoretically, this work contributes to the corpus of knowledge regarding the application of VQ-VAE in speech signal processing, particularly emphasizing the utility of discrete embeddings in handling complex voice conversion tasks.

Future research could explore the scalability of this method across languages and speaker datasets of varying sizes and quality. Additionally, the effectiveness of different embedding strategies and architectures could be a rich vein for exploration in ongoing efforts to refine and expand upon these methodologies.

In conclusion, this research articulates a robust approach to refining the process of voice conversion through methodological innovations in VQ-VAE architecture, yielding systems with improved speech quality and variable control capabilities.

PDF Markdown

Related Papers

YouTube

Show All Videos