Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS (2406.05699v1)

Published 9 Jun 2024 in eess.AS, cs.AI, and eess.SP

Abstract: Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of applying speech enhancement to the audio prompt.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Xiaofei Wang (138 papers)
  2. Sefik Emre Eskimez (28 papers)
  3. Manthan Thakker (9 papers)
  4. Hemin Yang (7 papers)
  5. Zirun Zhu (8 papers)
  6. Min Tang (80 papers)
  7. Yufei Xia (4 papers)
  8. Jinzhu Li (7 papers)
  9. Sheng Zhao (75 papers)
  10. Jinyu Li (164 papers)
  11. Naoyuki Kanda (61 papers)
Citations (3)