Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models (2309.15512v2)

Published 27 Sep 2023 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic & acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The non-autoregressive framework enhances controllability, and the duration diffusion model enables diversified prosodic expression. Contrastive Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic representation to solve the problems of information redundancy and dimension explosion in existing semantic coding methods. Mel-spectrogram is used as the acoustic representation. Both semantic and acoustic representations are predicted by continuous variable regression tasks to solve the problem of high-frequency fine-grained waveform distortion. Experimental results show that our proposed method outperforms the baseline method. We provide audio samples on our website.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
  2. “Deep voice: Real-time neural text-to-speech,” in International Conference on Machine Learning. PMLR, 2017, pp. 195–204.
  3. “Neural speech synthesis with transformer network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 6706–6713.
  4. “Fastspeech: Fast, robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  5. “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
  6. “Parallel tacotron: Non-autoregressive and controllable tts,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5709–5713.
  7. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020.
  8. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  9. “High fidelity neural audio compression,” ArXiv, vol. abs/2210.13438, 2022.
  10. “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022.
  11. “Audiolm: a language modeling approach to audio generation,” arXiv preprint arXiv:2209.03143, 2022.
  12. “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  13. “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
  14. “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” arXiv preprint arXiv:2302.03540, 2023.
  15. “Zero-shot voice conditioning for denoising diffusion tts models,” arXiv preprint arXiv:2206.02246, 2022.
  16. “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” arXiv preprint arXiv:2304.09116, 2023.
  17. “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv preprint arXiv:2306.15687, 2023.
  18. “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  19. “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 244–250.
  20. “Minimally-supervised speech synthesis with conditional diffusion model and language model: A comparative study of semantic coding,” arXiv preprint arXiv:2307.15484, 2023.
  21. “Cpsp: Learning speech concepts from phoneme supervision,” arXiv preprint arXiv:2309.00424, 2023.
  22. “Learning speech representation from contrastive token-acoustic pretraining,” arXiv preprint arXiv:2309.00424, 2023.
  23. “Style-label-free: Cross-speaker style transfer by quantized vae and speaker-wise normalization in speech synthesis,” in 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2022, pp. 61–65.
  24. “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  25. “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
  26. “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
  27. “Improving prosody for cross-speaker style transfer by semi-supervised style extractor and hierarchical modeling in speech synthesis,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  28. “Back-translation-style data augmentation for mandarin chinese polyphone disambiguation,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022, pp. 1915–1919.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chunyu Qiang (21 papers)
  2. Hao Li (803 papers)
  3. Yixin Tian (2 papers)
  4. Yi Zhao (222 papers)
  5. Ying Zhang (388 papers)
  6. Longbiao Wang (46 papers)
  7. Jianwu Dang (41 papers)
Citations (2)