One TTS Alignment To Rule Them All (2108.10447v1)

Published 23 Aug 2021 in cs.SD, cs.CL, cs.LG, and eess.AS

Abstract: Speech-to-text alignment is a critical component of neural textto-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive endto-end TTS models rely on durations extracted from external sources. In this paper we leverage the alignment mechanism proposed in RAD-TTS as a generic alignment learning framework, easily applicable to a variety of neural TTS models. The framework combines forward-sum algorithm, the Viterbi algorithm, and a simple and efficient static prior. In our experiments, the alignment learning framework improves all tested TTS architectures, both autoregressive (Flowtron, Tacotron 2) and non-autoregressive (FastPitch, FastSpeech 2, RAD-TTS). Specifically, it improves alignment convergence speed of existing attention-based mechanisms, simplifies the training pipeline, and makes the models more robust to errors on long utterances. Most importantly, the framework improves the perceived speech synthesis quality, as judged by human evaluators.

Authors (6)

Rohan Badlani (13 papers)
Adrian Łancucki (15 papers)
Kevin J. Shih (18 papers)
Rafael Valle (31 papers)
Wei Ping (51 papers)
Bryan Catanzaro (123 papers)

Citations (80)

View on Semantic Scholar

Summary

Text-to-Speech Alignment Framework Analysis

The paper "One TTS Alignment To Rule Them All" presents an alignment learning framework designed to enhance various neural text-to-speech (TTS) models, encompassing both autoregressive and non-autoregressive types. The authors introduce a methodology that is adaptable across these architectures, aiming to improve alignment convergence, training efficiency, and overall speech synthesis quality without relying on external aligners.

Overview of the Alignment Learning Framework

The proposed framework builds upon the alignment mechanism introduced in RAD-TTS, adopting a versatile, end-to-end alignment approach suitable for different TTS models. It employs techniques like the forward-sum algorithm, the Viterbi algorithm, and a static prior to effectively learn speech-text alignments. By incorporating these components, the framework tackles typical alignment challenges exhibited in TTS systems, such as brittleness in autoregressive models and dependencies on external alignments for non-autoregressive models.

Experimental Evaluation and Results

The experimental evaluation demonstrates that the alignment learning framework tangibly improves the performance of a diverse set of TTS architectures, including both Flowtron and Tacotron 2 (autoregressive), as well as FastPitch, FastSpeech 2, and RAD-TTS (non-autoregressive). The framework shows significant prowess in the following areas:

Convergence Speed: The alignment framework, particularly when coupled with a static prior, accelerates the convergence rate across models. For instance, the Flowtron model benefits considerably from the alignment framework, allowing simultaneous training of multiple flows and improving convergence substantially.
Alignment Quality: The framework produces sharper and more continuous alignments than baseline models, as evidenced by improvements in alignment matrices and evaluation metrics. The use of the unsupervised alignment loss enables models to produce alignments that closely mirror annotated ground truth, indicative of more confident representation in speech synthesis.
Speech Quality: Human evaluators preferred audio samples generated with models trained using the alignment framework over baselines across all examined architectures. Models using the framework had fewer errors related to word repetition and omission, manifesting in improved intelligibility and naturalness of synthesized speech.
Robustness and Error Reduction: The character error rates (CER) in models were reduced when employing the alignment framework. This is particularly relevant for autoregressive models, demonstrating robustness to errors, especially in the synthesis of long and challenging utterances.

Theoretical and Practical Implications

The alignment framework presented provides a unified approach to alignment learning that bypasses the need for resource-intensive external aligners. This presents a significant advancement, particularly for languages lacking robust alignment tools. The integration of this framework into TTS models not only simplifies the training pipeline but is also likely to enhance the model's applicability across various domains.

By demonstrating improvements in alignment predictability and training efficiency, this research lays the groundwork for more resilient and versatile TTS systems. Future exploration might delve into fine-tuning alignment strategies specific to language characteristics or extending the framework's application to cross-lingual TTS models.

In summary, the proposed alignment learning framework provides a comprehensive solution for improving both the training and inference aspects of diverse TTS models, promoting better user experiences through higher-quality, more robust speech synthesis.

PDF Markdown

Related Papers

YouTube

Show All Videos