Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 85 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech (2410.22179v2)

Published 29 Oct 2024 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backpropagation and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Location-relative attention mechanisms for robust long-form speech synthesis. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6194–6198.
  2. AudioLM: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533.
  3. A vector quantized approach for text to speech synthesis on real-world spontaneous speech. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12644–12652.
  4. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In ICASSP, pages 5884–5888. IEEE.
  5. Vall-t: Decoder-only generative transducer for robust and decoding-controllable text-to-speech. Preprint, arXiv:2401.14321.
  6. Image compression with product quantized masked image modeling. Transactions on Machine Learning Research.
  7. MADE: Masked autoencoder for distribution estimation. In International conference on machine learning, pages 881–889. PMLR.
  8. Alex Graves. 2012. Sequence transduction with recurrent neural networks. Preprint, arXiv:1211.3711.
  9. The impact of positional encoding on length generalization in transformers. Preprint, arXiv:2305.19466.
  10. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718.
  11. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. In Advances in Neural Information Processing Systems, volume 33, pages 8067–8077. Curran Associates, Inc.
  12. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, volume 33, pages 17022–17033. Curran Associates, Inc.
  13. Lessac Technologies, Inc. 2013. Release of voice factory audiobook recordings for Blizzard 2013.
  14. Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30.
  15. Online and Linear-time Attention by Enforcing Monotonic Alignments. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2837–2846. JMLR.org.
  16. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  17. Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations.
  18. Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783.
  19. Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301.
  20. Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering. Preprint, arXiv:2401.07333.
  21. Neural discrete representation learning. Advances in neural information processing systems, 30.
  22. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  23. Neural codec language models are zero-shot text to speech synthesizers. Preprint, arXiv:2301.02111.
  24. Tacotron: Towards End-to-End Speech Synthesis. In Proc. Interspeech 2017, pages 4006–4010.
  25. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203.
  26. Libritts: A corpus derived from librispeech for text-to-speech. Preprint, arXiv:1904.02882.
  27. Forward attention in sequence- to-sequence acoustic modeling for speech synthesis. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4789–4793.
  28. End-to-end dense video captioning with masked transformer. In CVPR, pages 8739–8748. Computer Vision Foundation / IEEE Computer Society.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 posts and received 473 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube