Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment (2003.01950v1)

Published 4 Mar 2020 in eess.AS, cs.CL, and cs.SD

Abstract: Targeting at both high efficiency and performance, we propose AlignTTS to predict the mel-spectrum in parallel. AlignTTS is based on a Feed-Forward Transformer which generates mel-spectrum from a sequence of characters, and the duration of each character is determined by a duration predictor.Instead of adopting the attention mechanism in Transformer TTS to align text to mel-spectrum, the alignment loss is presented to consider all possible alignments in training by use of dynamic programming. Experiments on the LJSpeech dataset show that our model achieves not only state-of-the-art performance which outperforms Transformer TTS by 0.03 in mean option score (MOS), but also a high efficiency which is more than 50 times faster than real-time.

An Overview of AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment

The paper introduces AlignTTS, an efficient feed-forward text-to-speech (TTS) system designed to enhance both the efficacy and performance of speech synthesis. Unlike traditional autoregressive models such as Tacotron and Transformer TTS that rely on sequence-to-sequence frameworks with attention mechanisms, AlignTTS employs a feed-forward network that predicts the mel-spectrum in parallel. This innovative approach involves utilizing a duration predictor and alignment loss techniques inspired by traditional statistical models like the Hidden Markov Models (HMMs) Baum-Welch algorithm.

Contributions and Methodology

AlignTTS's architecture is composed of a Feed-Forward Transformer with modules including a duration predictor and a mix density network. The Feed-Forward Transformer is responsible for transforming input text to a mel-spectrum via multiple FFT blocks that incorporate self-attention and a two-layer 1D convolutional network with additional mechanisms such as residual connections and layer normalization.

A key distinction from conventional TTS models is how AlignTTS handles the alignment of text to the mel-spectrum. Instead of using an attention mechanism, the duration predictor computes alignment in parallel, while the alignment loss considers all possible alignments using dynamic programming. This results in more precise alignment compared to models that extract alignment from pre-trained autoregressive guides.

The mix density network further facilitates the learning of alignments by predicting multi-dimensional Gaussian distributions for the mel-spectrum of each character. The Viterbi algorithm is employed post-convergence to confirm correct alignments.

To optimize training, a multi-phased approach is undertaken:

  • Phase 1: Training the mix density network using alignment loss.
  • Phase 2: Converting alignments into duration sequences to train the rest of the Feed-Forward Transformer.
  • Phase 3: Joint training of the Feed-Forward Transformer and mix density network to fine-tune parameters.
  • Phase 4: Training the duration predictor based on the final mix density network outcomes.

Experimental Insights

The LJSpeech dataset was utilized to evaluate AlignTTS's performance, demonstrating that it achieves state-of-the-art results, outperforming Transformer TTS with a mean opinion score (MOS) improvement of 0.03 and processing speed at more than 50 times faster than real-time. The system synthesizes approximately 10 seconds of speech in just 0.18 seconds, including the WaveGlow vocoder's contribution. Comparative analysis with models like Tacotron2, FastSpeech, and others substantiates AlignTTS's superiority in both synthesis speed and quality.

Implications and Future Perspectives

The findings elucidate the potential of non-autoregressive models for real-time applications, enhancing deployment feasibility in interactive dialogue systems due to their rapid processing capabilities. This advancement could lead to broader adoption and integration of TTS systems in various industry settings where response time is critical. The approach’s reliance on algorithm-driven learning and alignment without pre-trained models suggests pathways for further research in deepening the understanding and generating new strategies for non-autoregressive TTS models. Future work may explore improving the model's robustness and adaptability to different languages and speech characteristics.

In summary, AlignTTS represents a significant stride in employing non-autoregressive frameworks for TTS, offering insights into improving the efficiency and quality of speech synthesis systems. The combination of innovative alignment strategies and powerful feed-forward architectures sets a promising direction for future text-to-speech technological advancements.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhen Zeng (41 papers)
  2. Jianzong Wang (144 papers)
  3. Ning Cheng (96 papers)
  4. Tian Xia (66 papers)
  5. Jing Xiao (267 papers)
Citations (55)
Youtube Logo Streamline Icon: https://streamlinehq.com