Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech (2410.22179v2)
Abstract: Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backpropagation and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.
- Location-relative attention mechanisms for robust long-form speech synthesis. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6194–6198.
- AudioLM: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533.
- A vector quantized approach for text to speech synthesis on real-world spontaneous speech. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12644–12652.
- Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In ICASSP, pages 5884–5888. IEEE.
- Vall-t: Decoder-only generative transducer for robust and decoding-controllable text-to-speech. Preprint, arXiv:2401.14321.
- Image compression with product quantized masked image modeling. Transactions on Machine Learning Research.
- MADE: Masked autoencoder for distribution estimation. In International conference on machine learning, pages 881–889. PMLR.
- Alex Graves. 2012. Sequence transduction with recurrent neural networks. Preprint, arXiv:1211.3711.
- The impact of positional encoding on length generalization in transformers. Preprint, arXiv:2305.19466.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718.
- Glow-tts: A generative flow for text-to-speech via monotonic alignment search. In Advances in Neural Information Processing Systems, volume 33, pages 8067–8077. Curran Associates, Inc.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, volume 33, pages 17022–17033. Curran Associates, Inc.
- Lessac Technologies, Inc. 2013. Release of voice factory audiobook recordings for Blizzard 2013.
- Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30.
- Online and Linear-time Attention by Enforcing Monotonic Alignments. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2837–2846. JMLR.org.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations.
- Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783.
- Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301.
- Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering. Preprint, arXiv:2401.07333.
- Neural discrete representation learning. Advances in neural information processing systems, 30.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Neural codec language models are zero-shot text to speech synthesizers. Preprint, arXiv:2301.02111.
- Tacotron: Towards End-to-End Speech Synthesis. In Proc. Interspeech 2017, pages 4006–4010.
- Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203.
- Libritts: A corpus derived from librispeech for text-to-speech. Preprint, arXiv:1904.02882.
- Forward attention in sequence- to-sequence acoustic modeling for speech synthesis. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4789–4793.
- End-to-end dense video captioning with masked transformer. In CVPR, pages 8739–8748. Computer Vision Foundation / IEEE Computer Society.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.