An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis
Abstract: We propose a new model architecture specifically suited for text-to-speech (TTS) models. We combine WavLM, a pre-trained self-supervised learning (SSL) speech model, and the BEST-RQ vector quantization framework. We assess the extent to which the more task-agnostic WavLM, coupled with the superior suitability of the simplistic BEST-RQ framework for a wider array of downstream tasks, yields favorable outcomes. Experiments on the LibriSpeech dataset with SUPERB benchmarking assert that the proposed model significantly underperforms. We speculate the underlying reason for this performance is related to the difference between featurizing raw audio waveforms and spectrograms with a quantizer. We discuss the limitations of this approach to better guide future advancements in TTS.
- “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv, March 2023.
- “Neural codec language models are zero-shot text to speech synthesizers,” arXiv, January 2023.
- “Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision,” arXiv e-prints, p. arXiv:2302.03540, Feb. 2023.
- “SoundStorm: Efficient Parallel Audio Generation,” arXiv e-prints, p. arXiv:2305.09636, May 2023.
- “AudioLM: a Language Modeling Approach to Audio Generation,” arXiv e-prints, p. arXiv:2209.03143, Sept. 2022.
- “Conformer: Convolution-augmented Transformer for Speech Recognition,” arXiv e-prints, p. arXiv:2005.08100, May 2020.
- “Self-supervised Learning with Random-projection Quantizer for Speech Recognition,” arXiv e-prints, p. arXiv:2202.01855, Feb. 2022.
- “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” arXiv e-prints, p. arXiv:2006.11477, June 2020.
- “W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training,” arXiv e-prints, p. arXiv:2108.06209, Aug. 2021.
- “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, Oct. 2022.
- “Librispeech: An asr corpus based on public domain audio books,” 04 2015, pp. 5206–5210.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” CoRR, vol. abs/2106.07447, 2021.
- “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” arXiv e-prints, p. arXiv:1912.07875, Dec. 2019.
- “Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio,” CoRR, vol. abs/2106.06909, 2021.
- “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, Aug. 2021, pp. 993–1003, Association for Computational Linguistics.
- “Self-Supervised Speech Representation Learning: A Review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, Oct. 2022.
- “Pushing the limits of semi-supervised learning for automatic speech recognition,” 2022.
- “Fixing weight decay regularization in adam,” CoRR, vol. abs/1711.05101, 2017.
- “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, 2021, pp. 1194–1198.
- “Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces,” CoRR, vol. abs/1805.10190, 2018.
- “CoVoST: A diverse multilingual speech-to-text translation corpus,” in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France, May 2020, pp. 4197–4203, European Language Resources Association.
- “Common voice: A massively-multilingual speech corpus,” CoRR, vol. abs/1912.06670, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.