Hierarchical Generative Modeling for Controllable Speech Synthesis (1810.07217v2)

Published 16 Oct 2018 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, we train a high-quality controllable TTS model on real found data, which is capable of inferring speaker and style attributes from a noisy utterance and use it to synthesize clean speech with controllable speaking style.

PDF Abstract

Hierarchical Generative Modeling for Controllable Speech Synthesis

The paper "Hierarchical Generative Modeling for Controllable Speech Synthesis" addresses the challenge of generating synthesized speech with control over various speech attributes that aren't typically annotated in training datasets, such as speaking style, accent, and noise conditions. The work presents a model based on a sequence-to-sequence framework and uses a Variational Autoencoder (VAE) approach with hierarchical latent variables to manage these attributes effectively.

Model Architecture and Contribution

The core of the proposed model lies in its hierarchical architecture, which features two levels of latent variables. The first level comprises a categorical variable indicative of broad attribute groups such as 'clean' versus 'noisy'. This categorization endows the model with an interpretative capacity to understand distinctions between different attributes more abstractly. The second level is represented as a continuous multivariate Gaussian distribution conditioned on the first, providing the fine-grained control needed to synthesize distinct audio variations like specific levels of background noise or unique speaking rates.

Key contributions of this model include:

Hierarchical Variational Autoencoding: The model uses a Gaussian Mixture Model (GMM) within its VAE framework. This allows it to learn a disentangled and interpretable latent space representation, contributing to a systematic mechanism for sampling.
Systematic Control and Sampling: The disentangled latent space ensures that one can independently control different generating factors in the speech synthesis process. Such a capacity is particularly beneficial for data augmentation, enabling diversity without manual labeling.
Real-world Data Application: Unlike many models that rely on strictly controlled databases, this approach demonstrates effectiveness using 'found' data, which includes real noise and prosody variances. Hence, it can infer and cleanly synthesize speech from inputs with high noise levels or previously unseen speaker attributes.

Experimental Evaluations and Implications

The experiments conducted involve both single-speaker and multi-speaker datasets, including artificially noisy scenarios to validate the model's ability to disentangle and control attributes like noise level, speaker, and style. The model has been shown to cluster latent variables into meaningful categories, thus reflecting intelligible and discernable patterns, particularly in accent and gender.

The paper reports on significant quantitative findings:

Importantly, for multi-speaker datasets, the model demonstrates a consistency of 92.9% when clustering utterances by speaker. This suggests a strong inference ability regarding speaker characteristics.
Subjective evaluations, including Mean Opinion Scores (MOS) on various dataset configurations, validate the enhanced naturalness and quality of the synthesized speech compared to traditional approaches like GST-based models or baseline VAEs.

Practical and Theoretical Implications

The paper opens several pathways for advancing controllable TTS systems:

Practical Usability: The approach of leveraging hierarchical latent variables makes it feasible to synthesize clear and style-appropriate speech across diverse conditions without requiring explicitly annotated training data.
Data Augmentation: The ability to sample diverse examples through a systematic model mechanism suggests potential in tasks beyond TTS, such as in robust speech recognition or augmentation of datasets.
Future Research Directions: A natural extension of this work would further optimize and perhaps compress the representation space to enhance computational efficiency while maintaining, or improving, model performance.

Overall, the research showcases an adept advancement in creating flexible, high-quality, and controllable speech synthesis models that work efficiently even with challenging real-world input data conditions.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Wei-Ning Hsu (76 papers)
Yu Zhang (1400 papers)
Ron J. Weiss (30 papers)
Heiga Zen (36 papers)
Yonghui Wu (115 papers)
Yuxuan Wang (239 papers)
Yuan Cao (201 papers)
Ye Jia (33 papers)
Zhifeng Chen (65 papers)
Jonathan Shen (13 papers)
Patrick Nguyen (15 papers)
Ruoming Pang (59 papers)

Citations (269)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos