Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment (2406.17957v1)

Published 25 Jun 2024 in cs.SD, cs.AI, and eess.AS

Abstract: LLM based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Paarth Neekhara (21 papers)
  2. Shehzeen Hussain (17 papers)
  3. Subhankar Ghosh (41 papers)
  4. Jason Li (91 papers)
  5. Rafael Valle (31 papers)
  6. Rohan Badlani (13 papers)
  7. Boris Ginsburg (111 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com