Learning latent representations for style control and transfer in end-to-end speech synthesis (1812.04342v2)

Published 11 Dec 2018 in cs.CL, cs.SD, and eess.AS

Abstract: In this paper, we introduce the Variational Autoencoder (VAE) to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner. The style representation learned through VAE shows good properties such as disentangling, scaling, and combination, which makes it easy for style control. Style transfer can be achieved in this framework by first inferring style representation through the recognition network of VAE, then feeding it into TTS network to guide the style in synthesizing speech. To avoid Kullback-Leibler (KL) divergence collapse in training, several techniques are adopted. Finally, the proposed model shows good performance of style control and outperforms Global Style Token (GST) model in ABX preference tests on style transfer.

PDF Abstract

Learning Latent Representations for Style Control and Transfer in End-to-End Speech Synthesis

In this paper, Zhang et al. introduce an innovative approach that leverages Variational Autoencoders (VAEs) to learn and manipulate latent representations of speaking styles in an end-to-end Text-to-Speech (TTS) synthesis framework. This methodology addresses the challenge of expressive speech synthesis by providing a mechanism to control and transfer speaking styles in a data-driven manner without requiring supervised annotations.

Methodology Overview

The authors employ a VAE to capture the latent style representations in continuous space. The VAE architecture enables the disentanglement of style factors, allowing for seamless control over the speaking styles in synthetic speech. Their model incorporates two main components: a recognition network for inferring style representations from reference audio and a modified Tacotron 2 framework for generating speech based on these style representations. To counteract the common issue of Kullback-Leibler divergence collapse during VAE training, the authors implement KL annealing alongside adjusted training dynamics.

Experimental Evaluation

The performance of the proposed model is evaluated using a dataset comprising 105 hours of audiobook recordings exhibiting diverse storytelling styles. The experiments focus on both style control, through interpolating latent variables and exploring disentangled factors, and style transfer, by synthesizing voice using style variables inferred from reference audio.

The authors highlight significant capabilities of their model:

Continuous Latent Representation:
- The model supports smooth interpolation between latent representations, resulting in gradual changes in style attributes such as pitch and speaking rate.
Disentangled Factors:
- Certain dimensions in the latent space are found to correspond to independent style attributes, showcasing the model's ability to learn interpretable representation.
Effective Style Transfer:
- The proposed model demonstrates superior performance in style transfer tasks compared to the baseline Global Style Token (GST) model. In particular, it excels in non-parallel style transfer, suggesting robustness and versatility in varying contexts.

Results and Implications

Through ABX preference tests, the authors substantiate their claim that the VAE-based model outperforms the GST model in both parallel and non-parallel style transfer scenarios. This superiority illustrates the VAE's effectiveness in modeling complex style distributions, subsequently enhancing style transfer fidelity.

The results imply significant potential for augmenting synthetic speech data diversity, which is crucial for training robust speech systems. Furthermore, the research provokes future exploration into disentangled style representations, potentially leading to more interpretable and adaptive speech synthesis systems.

Future Directions

Key future directions suggested by the authors include refining the disentanglement process to yield more interpretable latent variables and expanding the model's applicability to multi-speaker scenarios. Advancing these areas could lead to more generalized and scalable solutions for expressive speech synthesis.

Overall, this paper contributes a novel and promising method for incorporating VAEs into speech synthesis tasks for style control and transfer, marking a step forward in the pursuit of nuanced and expressive TTS systems.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Ya-Jie Zhang (3 papers)
Shifeng Pan (5 papers)
Lei He (121 papers)
Zhen-Hua Ling (114 papers)

Citations (222)

View on Semantic Scholar

Learning latent representations for style control and transfer in end-to-end speech synthesis (1812.04342v2)