Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling (2012.09547v2)

Published 17 Dec 2020 in eess.AS and cs.SD

Abstract: While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect. In many scenarios, only noisy speech of a target speaker is available, which presents challenges for TTS model training for this speaker. Previous works usually address the challenge using two methods: 1) training the TTS model using the speech denoised with an enhancement model; 2) taking a single noise embedding as input when training with noisy speech. However, they usually cannot handle speech with real-world complicated noise such as those with high variations along time. In this paper, we develop DenoiSpeech, a TTS system that can synthesize clean speech for a speaker with noisy speech data. In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition module, which is jointly trained with the TTS model. Experimental results on real-world data show that DenoiSpeech outperforms the previous two methods by 0.31 and 0.66 MOS respectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chen Zhang (403 papers)
  2. Yi Ren (215 papers)
  3. Xu Tan (164 papers)
  4. Jinglin Liu (38 papers)
  5. Kejun Zhang (26 papers)
  6. Tao Qin (201 papers)
  7. Sheng Zhao (75 papers)
  8. Tie-Yan Liu (242 papers)
Citations (35)