A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS (2303.02719v2)
Abstract: Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts
- “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis,” in Proc. Interspeech, 2022, pp. 833–837.
- “VQTTS: High-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature,” in Proc. Interspeech, 2022, pp. 1596–1600.
- “HierSpeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis,” in Proc. NeurIPS, 2023.
- Shu-wen et al. Yang, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198.
- “wav2vec: Unsupervised pre-training for speech recognition,” in Proc. Interspeech, 2019, pp. 3465–3469.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, vol. 33, pp. 12449–12460.
- “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “Speech resynthesis from discrete disentangled self-supervised representations,” in Proc. Interspeech, 2021, pp. 3615–3619.
- “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
- “Probing acoustic representations for phonetic properties,” in Proc. ICASSP, 2021, pp. 311–315.
- “Layer-wise analysis of a self-supervised speech representation model,” in Proc. ASRU, 2021, pp. 914–921.
- “Exploration of a self-supervised speech model: A study on emotional corpora,” in Proc. SLT, 2022.
- “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
- A. Lańcucki, “FastPitch: Parallel text-to-speech with pitch prediction,” in Proc. ICASSP, 2021, pp. 6588–6592.
- “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
- “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” Proc. NeurIPS, vol. 33, pp. 8067–8077, 2020.
- “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Proc. NeurIPS, pp. 17022–17033, 2020.
- “A survey on neural speech synthesis,” arXiv preprint arXiv:2106.15561, 2021.
- “How to train your fillers: uh and um in spontaneous speech synthesis,” in Proc. SSW, 2019, vol. 10, pp. 245–250.
- “Spontaneous conversational speech synthesis from found data,” in Proc. Interspeech, 2019, pp. 4435–4439.
- “A vector quantized approach for text to speech synthesis on real-world spontaneous speech,” arXiv preprint arXiv:2302.04215, 2023.
- K. Ito and L. Johnson, “The LJ Speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
- Y. Ferstl and R. McDonnell, “Investigating the use of recurrent motion modelling for speech gesture generation,” in Proc. IVA, 2018, pp. 93–98.
- “Breathing and speech planning in spontaneous speech synthesis,” in Proc. ICASSP, 2020, pp. 7649–7653.
- “One tts alignment to rule them all,” in Proc. ICASSP, 2022, pp. 6092–6096.
- “Speech synthesis evaluation—state-of-the-art assessment and suggestion for a novel research program,” in Proc. SSW, 2019.
- “Generalization ability of mos prediction networks,” in Proc. ICASSP, 2022, pp. 8442–8446.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.