High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models (2309.15512v2)
Abstract: Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic & acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The non-autoregressive framework enhances controllability, and the duration diffusion model enables diversified prosodic expression. Contrastive Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic representation to solve the problems of information redundancy and dimension explosion in existing semantic coding methods. Mel-spectrogram is used as the acoustic representation. Both semantic and acoustic representations are predicted by continuous variable regression tasks to solve the problem of high-frequency fine-grained waveform distortion. Experimental results show that our proposed method outperforms the baseline method. We provide audio samples on our website.
- “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
- “Deep voice: Real-time neural text-to-speech,” in International Conference on Machine Learning. PMLR, 2017, pp. 195–204.
- “Neural speech synthesis with transformer network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 6706–6713.
- “Fastspeech: Fast, robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
- “Parallel tacotron: Non-autoregressive and controllable tts,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5709–5713.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “High fidelity neural audio compression,” ArXiv, vol. abs/2210.13438, 2022.
- “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022.
- “Audiolm: a language modeling approach to audio generation,” arXiv preprint arXiv:2209.03143, 2022.
- “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
- “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
- “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” arXiv preprint arXiv:2302.03540, 2023.
- “Zero-shot voice conditioning for denoising diffusion tts models,” arXiv preprint arXiv:2206.02246, 2022.
- “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” arXiv preprint arXiv:2304.09116, 2023.
- “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv preprint arXiv:2306.15687, 2023.
- “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 244–250.
- “Minimally-supervised speech synthesis with conditional diffusion model and language model: A comparative study of semantic coding,” arXiv preprint arXiv:2307.15484, 2023.
- “Cpsp: Learning speech concepts from phoneme supervision,” arXiv preprint arXiv:2309.00424, 2023.
- “Learning speech representation from contrastive token-acoustic pretraining,” arXiv preprint arXiv:2309.00424, 2023.
- “Style-label-free: Cross-speaker style transfer by quantized vae and speaker-wise normalization in speech synthesis,” in 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2022, pp. 61–65.
- “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
- “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
- “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
- “Improving prosody for cross-speaker style transfer by semi-supervised style extractor and hierarchical modeling in speech synthesis,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Back-translation-style data augmentation for mandarin chinese polyphone disambiguation,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022, pp. 1915–1919.
- Chunyu Qiang (21 papers)
- Hao Li (803 papers)
- Yixin Tian (2 papers)
- Yi Zhao (222 papers)
- Ying Zhang (388 papers)
- Longbiao Wang (46 papers)
- Jianwu Dang (41 papers)