Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters (2401.05111v1)

Published 10 Jan 2024 in cs.SD, cs.CL, cs.LG, and eess.AS

Abstract: The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately. However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise. In this paper, we propose a noise-robust zero-shot TTS method. We incorporated adapters into the SSL model, which we fine-tuned with the TTS model using noisy reference speech. In addition, to further improve performance, we adopted a speech enhancement (SE) front-end. With these improvements, our proposed SSL-based zero-shot TTS achieved high-quality speech synthesis with noisy reference speech. Through the objective and subjective evaluations, we confirmed that the proposed method is highly robust to noise in reference speech, and effectively works in combination with SE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in ICASSP, 2018.
  2. “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in ICLR, 2020.
  3. “Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech,” in ICASSP, 2021.
  4. “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in ICASSP, 2020.
  5. “Neural codec language models are zero-shot text to speech synthesizers,” arXiv, 2023, arXiv:2301.02111.
  6. “Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model,” in ICASSP Workshops (ICASSPW), 2023.
  7. “End-to-end text-dependent speaker verification,” in ICASSP, 2016.
  8. “Speaker adaptation in DNN-based speech synthesis using d-vectors.,” in Interspeech, 2017.
  9. “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP, 2018.
  10. “Phoneme duration modeling using speech rhythm-based speaker embeddings for multi-speaker speech synthesis,” in Interspeech, 2021.
  11. Elisabeth Zetterholm, “Intonation pattern and duration differences in imitated speech,” in Speech Prosody, 2002.
  12. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS, 2020.
  13. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  14. “DenoiSpeech: Denoising text to speech with frame-level noise modeling,” in ICASSP, 2021.
  15. “DRSpeech: Degradation-robust text-to-speech synthesis with frame-level and utterance-level acoustic representation learning,” in Interspeech, 2022.
  16. “Fine-grained noise control for multispeaker speech synthesis,” in Interspeech, 2022.
  17. “NoreSpeech: Knowledge distillation based conditional diffusion model for noise-robust expressive TTS,” in Interspeech, 2023.
  18. “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J-STSP, vol. 16, no. 6, pp. 1505–1518, 2022.
  19. “Boosting self-supervised embeddings for speech enhancement,” in Interspeech, 2022.
  20. Robert M French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999.
  21. “Parameter-efficient transfer learning for NLP,” in ICML, 2019.
  22. “CHAPTER: Exploiting convolutional neural network adapters for self-supervised speech models,” in ICASSP Workshops (ICASSPW), 2023.
  23. “End-to-End integration of speech recognition, speech enhancement, and self-supervised learning representation,” in Interspeech, 2022.
  24. “Downstream task agnostic speech enhancement with self-supervised representation loss,” in Interspeech, 2023.
  25. “Why does self-supervised learning for speech recognition benefit speaker recognition?,” in Interspeech, 2022.
  26. “Deep speaker embeddings for short-duration speaker verification.,” in Interspeech, 2017.
  27. “Soft-target training with ambiguous emotional utterances for DNN-based speech emotion classification,” in ICASSP, 2018.
  28. “Exploring efficient-tuning methods in self-supervised speech models,” in SLT, 2022.
  29. “Flamingo: a visual language model for few-shot learning,” NeurIPS, 2022.
  30. “MUSAN: A music, speech, and noise corpus,” arXiv, 2015, arXiv:1510.08484.
  31. “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in NeurIPS, 2020.
  32. “Adam: A method for stochastic optimization,” in ICLR, 2015.
  33. “Attention is all you need,” in NeurIPS, 2017.
  34. Yi Luo and Nima Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  35. “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in ICLR, 2021.
  36. “JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech,” in Interspeech, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Kenichi Fujita (4 papers)
  2. Hiroshi Sato (40 papers)
  3. Takanori Ashihara (28 papers)
  4. Hiroki Kanagawa (3 papers)
  5. Marc Delcroix (94 papers)
  6. Takafumi Moriya (30 papers)
  7. Yusuke Ijima (11 papers)
Citations (6)