Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (2005.11129v2)

Published 22 May 2020 in eess.AS and cs.SD

Abstract: Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality. We further show that our model can be easily extended to a multi-speaker setting.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jaehyeon Kim (16 papers)
  2. Sungwon Kim (32 papers)
  3. Jungil Kong (5 papers)
  4. Sungroh Yoon (163 papers)
Citations (433)

Summary

Glow-TTS: A Synthesis of Parallel Text-to-Speech

The field of Text-to-Speech (TTS) synthesis has seen substantial advancements with the advent of neural networks, particularly autoregressive models like Tacotron 2 and Transformer TTS, which have set benchmarks in generating natural and expressive speech. However, these models inherently suffer from limitations in real-time applications due to their sequential nature, which leads to computational inefficiencies as the length of the input text increases. Parallel TTS models such as FastSpeech address these shortcomings by generating mel-spectrograms in parallel, thus offering improvements in synthesis speed and robustness. However, these models typically rely on external autoregressive models for alignment information, complicating the training pipeline.

The research under review puts forth Glow-TTS, which stands as a novel contribution in the parallel TTS domain by eliminating the dependency on external aligners. Glow-TTS employs flow-based generative models that leverage the transformational properties of flow networks to inherently learn the necessary alignments between textual and speech data. This approach provides a streamlined training process that contrasts with the complexity seen in models depending on autoregressive teachers.

Methodological Framework

Glow-TTS builds upon normalizing flows, a type of generative model that provides efficient sampling and exact likelihood computation, offering a strong foundation for parallel synthesis. The model capitalizes on dynamic programming techniques to ensure monotonicity in alignments, which guarantees robustness across diverse input text scenarios, including lengthy passages where autoregressive counterparts like Tacotron 2 may falter. The Monotonic Alignment Search (MAS) algorithm allows Glow-TTS to determine the most probable alignment without reliance on pre-trained models.

The model architecture includes a flow-based decoder that efficiently transforms latent representations into mel-spectrograms. By integrating grouped 1x1 convolutions and coupling layers, Glow-TTS maintains computational efficiency during both the forward and inverse transformations, which is crucial given the high dimensional nature of spectrogram data.

Empirical Results

Quantitative evaluations of Glow-TTS demonstrate competitive performance relative to Tacotron 2 in audio quality, measured through MOS (Mean Opinion Score) from human listeners. A notable strength of Glow-TTS is the synthesis speed; it outpaces Tacotron 2 by a factor of 15.7, which is significant for applications requiring low-latency response times. The flow-based architecture also facilitates diverse and controllable outputs, enabling modifications in speech intonation, pitch, and duration without stringent re-training, a feat particularly beneficial in personalization and adaptive storytelling applications.

Experimentation also extends Glow-TTS into multi-speaker settings, showcasing the model's scalability and adaptability. Minor architectural adjustments allow the incorporation of speaker embeddings, enabling the model to synthesize speech across various speaker identities with minimal degradation in quality.

Theoretical Implications and Future Directions

The introduction of an internal, flow-based alignment mechanism presents a paradigm shift, highlighting the potential of generative flows in TTS applications. This architecture optimally combines the stability and expressiveness afforded by normalizing flows with the robustness against alignment errors, traditionally a bane in TTS systems. The success of Glow-TTS invites further investigation into advanced flow models that can handle even more complex speech synthesis tasks, including emotional expression and contextual speech variation.

Possible avenues for future research include optimizing the computational demands of training flow-based models, investigating hybrid models that balance the benefits of autoregressive and flow-based methods, and expanding the multi-speaker capabilities with larger and more diverse datasets to improve the model's generalization capabilities.

Conclusion

Glow-TTS represents a salient advancement in TTS technology, amalgamating the principles of flow-based generative modeling and efficient optimization techniques to deliver fast, flexible, and high-quality speech synthesis. Its capability to interpolate and extrapolate diverse vocal characteristics positions it strategically for deployment in a wide array of applications, from accessible technology to dynamic content generation systems. This research sets a foundation for future exploration into eliminating dependencies on traditional autoregressive models, paving the way for more autonomous and efficient TTS solutions.

Youtube Logo Streamline Icon: https://streamlinehq.com