Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2005.05957v3)

Published 12 May 2020 in cs.SD, cs.CL, cs.LG, and eess.AS

Abstract: In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron borrows insights from IAF and revamps Tacotron in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to control many aspects of speech synthesis (pitch, tone, speech rate, cadence, accent). Our mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality. In addition, we provide results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training. Code and pre-trained models will be made publicly available at https://github.com/NVIDIA/flowtron

View on arXiv

Authors (4)

Rafael Valle (31 papers)
Kevin Shih (6 papers)
Ryan Prenger (10 papers)
Bryan Catanzaro (123 papers)

Citations (116)

View on Semantic Scholar

Summary

Flowtron: An Autoregressive Flow-Based Generative Network for Text-to-Speech Synthesis

The paper introduces Flowtron, a text-to-speech (TTS) synthesis model leveraging autoregressive flow-based generative networks. The model stands out by allowing explicit control over speech variation and style transfer. Flowtron builds upon the principles of inverse autoregressive flow (IAF) and refines Tacotron models, primarily targeting high-quality mel-spectrogram synthesis.

Key Features and Methodologies

Flowtron distinguishes itself through several innovative aspects:

Invertible Mapping: Flowtron learns an invertible mapping from data to a latent space. This mapping enables manipulation of speech attributes such as pitch, tone, and accent.
Simple and Stable Training: By optimizing the likelihood of the training data, the model ensures straightforward and stable training.
Variability and Expressivity: The model can modulate speech variation by sampling from a Gaussian distribution, offering enhanced control compared to preceding models like Tacotron.
Speaker and Style Transfer: Flowtron demonstrates capabilities in style transfer, accommodating both seen and unseen speaker data during the training phase.

Quantitative and Qualitative Performance

The paper presents mean opinion scores (MOS) confirming Flowtron's competitive performance against state-of-the-art models, especially in terms of speech quality. Furthermore, Flowtron achieves this without the need for complex architectures or compound loss functions, as necessitated by traditional models.

Sample Control: Empirical results suggest that Flowtron provides control over the degree of variation, outperforming Tacotron 2, particularly in terms of pitch contours and speech duration variations.
Interpolation and Transfer: The model excels in interpolating between samples and transferring styles between different speakers. It utilizes a spherical Gaussian prior that can be adjusted for desired variability, enabling expressive speech synthesis.

Architectural Insights

Flowtron employs an autoregressive structure and makes use of affine coupling layers, maintaining invertibility while facilitating efficient computation. The text encoder is derived from Tacotron with modifications like replacing batch normalization with instance normalization. There is flexibility in training flow steps with both fixed and learnable parameters for Gaussian mixtures.

Implications and Future Directions

Flowtron advances the field of TTS by pushing the boundaries of expressive speech synthesis. Notably, it provides avenues for applications requiring realistic and expressive human-computer interactions without reliance on labeled expressive datasets. Future endeavors could explore the integration of Flowtron's versatile architecture in broader AI systems, enhancing conversational agents or applications within creative domains such as virtual storytelling and interactive media.

Conclusion

By embedding control and variability directly into the TTS synthesis process, Flowtron effectively shifts the paradigm from mere text conversion to expressive auditory generation. Its comprehensive exploitation of autoregressive flow-based models undeniably broadens the horizon for future developments in generative audio models. Through meticulous experimentation and innovative design, the work convincingly showcases the potential of normalizing flow models in achieving nuanced speech synthesis.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis - NVIDIA ADLR (868 stars)