Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding (2307.15484v3)

Published 28 Jul 2023 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods. To address these problems, three progressive methods are proposed. First, we propose Diff-LM-Speech, an autoregressive structure consisting of a LLM and diffusion models, which models the semantic embedding into the mel-spectrogram based on a diffusion model to achieve higher audio quality. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Second, we propose Tetra-Diff-Speech, a non-autoregressive structure consisting of four diffusion model-based modules that design a duration diffusion model to achieve diverse prosodic expressions. Finally, we propose Tri-Diff-Speech, a non-autoregressive structure consisting of three diffusion model-based modules that verify the non-necessity of existing semantic encoding models and achieve the best results. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
  2. “Deep voice: Real-time neural text-to-speech,” in International Conference on Machine Learning. PMLR, 2017, pp. 195–204.
  3. “Neural speech synthesis with transformer network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 6706–6713.
  4. “Fastspeech: Fast, robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  5. “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
  6. “Parallel tacotron: Non-autoregressive and controllable tts,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5709–5713.
  7. “Improving language understanding by generative pre-training,” 2018.
  8. “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  9. “Audiolm: a language modeling approach to audio generation,” arXiv preprint arXiv:2209.03143, 2022.
  10. “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  11. “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
  12. “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” arXiv preprint arXiv:2302.03540, 2023.
  13. “Zero-shot voice conditioning for denoising diffusion tts models,” arXiv preprint arXiv:2206.02246, 2022.
  14. “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” arXiv preprint arXiv:2304.09116, 2023.
  15. “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv preprint arXiv:2306.15687, 2023.
  16. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020.
  17. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  18. “High fidelity neural audio compression,” ArXiv, vol. abs/2210.13438, 2022.
  19. “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022.
  20. “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  21. “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  22. “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
  23. “Improving prosody for cross-speaker style transfer by semi-supervised style extractor and hierarchical modeling in speech synthesis,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  24. “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  25. “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
  26. “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
  27. “Aishell-3: A multi-speaker mandarin tts corpus and the baselines,” arXiv preprint arXiv:2010.11567, 2020.
  28. “Back-translation-style data augmentation for mandarin chinese polyphone disambiguation,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022, pp. 1915–1919.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chunyu Qiang (21 papers)
  2. Hao Li (803 papers)
  3. Hao Ni (43 papers)
  4. He Qu (4 papers)
  5. Ruibo Fu (54 papers)
  6. Tao Wang (700 papers)
  7. Longbiao Wang (46 papers)
  8. Jianwu Dang (41 papers)
Citations (7)