Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 111 tok/s Pro
Kimi K2 161 tok/s Pro
GPT OSS 120B 412 tok/s Pro
Claude Sonnet 4 35 tok/s Pro
2000 character limit reached

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS (2303.02719v2)

Published 5 Mar 2023 in eess.AS, cs.HC, cs.LG, and cs.SD

Abstract: Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis,” in Proc. Interspeech, 2022, pp. 833–837.
  2. “VQTTS: High-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature,” in Proc. Interspeech, 2022, pp. 1596–1600.
  3. “HierSpeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis,” in Proc. NeurIPS, 2023.
  4. Shu-wen et al. Yang, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198.
  5. “wav2vec: Unsupervised pre-training for speech recognition,” in Proc. Interspeech, 2019, pp. 3465–3469.
  6. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, vol. 33, pp. 12449–12460.
  7. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  8. “Speech resynthesis from discrete disentangled self-supervised representations,” in Proc. Interspeech, 2021, pp. 3615–3619.
  9. “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
  10. “Probing acoustic representations for phonetic properties,” in Proc. ICASSP, 2021, pp. 311–315.
  11. “Layer-wise analysis of a self-supervised speech representation model,” in Proc. ASRU, 2021, pp. 914–921.
  12. “Exploration of a self-supervised speech model: A study on emotional corpora,” in Proc. SLT, 2022.
  13. “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
  14. A. Lańcucki, “FastPitch: Parallel text-to-speech with pitch prediction,” in Proc. ICASSP, 2021, pp. 6588–6592.
  15. “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
  16. “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” Proc. NeurIPS, vol. 33, pp. 8067–8077, 2020.
  17. “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Proc. NeurIPS, pp. 17022–17033, 2020.
  18. “A survey on neural speech synthesis,” arXiv preprint arXiv:2106.15561, 2021.
  19. “How to train your fillers: uh and um in spontaneous speech synthesis,” in Proc. SSW, 2019, vol. 10, pp. 245–250.
  20. “Spontaneous conversational speech synthesis from found data,” in Proc. Interspeech, 2019, pp. 4435–4439.
  21. “A vector quantized approach for text to speech synthesis on real-world spontaneous speech,” arXiv preprint arXiv:2302.04215, 2023.
  22. K. Ito and L. Johnson, “The LJ Speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  23. Y. Ferstl and R. McDonnell, “Investigating the use of recurrent motion modelling for speech gesture generation,” in Proc. IVA, 2018, pp. 93–98.
  24. “Breathing and speech planning in spontaneous speech synthesis,” in Proc. ICASSP, 2020, pp. 7649–7653.
  25. “One tts alignment to rule them all,” in Proc. ICASSP, 2022, pp. 6092–6096.
  26. “Speech synthesis evaluation—state-of-the-art assessment and suggestion for a novel research program,” in Proc. SSW, 2019.
  27. “Generalization ability of mos prediction networks,” in Proc. ICASSP, 2022, pp. 8442–8446.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.