DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation (2310.01381v3)
Abstract: Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical advantages. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes the overall waveform sounds more natural. Furthermore, the proposed diffusion model is stochastic and not deterministic; therefore, each inference generates a slightly different waveform variation, enabling abundance of valid realizations. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
- Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pp. 2709–2720. PMLR, 2022.
- Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
- Wavegrad 2: Iterative refinement for text-to-speech synthesis. arXiv preprint arXiv:2106.09660, 2021.
- A single molecule as a high-fidelity photon gun for producing intensity-squeezed light. Nature Photonics, 11(1):58–62, 2017.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575, 2020.
- Parallel Tacotron 2: A non-autoregressive neural TTS model with differentiable duration modeling. arXiv preprint arXiv:2103.14574, 2021.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Video diffusion models. arXiv:2204.03458, 2022.
- Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021.
- Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934, 2022a.
- Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 2595–2605, 2022b.
- The LJ speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
- Diff-TTS: A denoising diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409, 2021.
- Acoustic properties of different kinds of creaky voice. In ICPhS, volume 2015, pp. 2–7, 2015.
- Guided-TTS: A diffusion model for text-to-speech via classifier guidance. In International Conference on Machine Learning, pp. 11119–11133. PMLR, 2022.
- Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pp. 5530–5540. PMLR, 2021.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Stuck in the MOS pit: A critical analysis of MOS test methodology in tts evaluation. In 12th Speech Synthesis Workshop (SSW) 2023, 2023.
- Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8102–8106. IEEE, 2022.
- Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020a.
- Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020b.
- Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687, 2023.
- Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior. arXiv preprint arXiv:2106.06406, 2021.
- Diffgan-tts: High-fidelity and efficient text-to-speech with denoising diffusion gans. arXiv preprint arXiv:2201.11972, 2022.
- Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pp. 498–502, 2017.
- NP Narendra and K Sreenivasa Rao. Generation of creaky voice for improving the quality of hmm-based speech synthesis. Computer Speech & Language, 42:38–58, 2017.
- Parallel waventert: Fast high-fidelity speech synthesis. In International conference on machine learning, pp. 3918–3926, 2018.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- The importance of phase in signals. Proceedings of the IEEE, 69(5):529–541, 1981.
- g2pe. https://github.com/Kyubyong/g2p, 2019.
- Grad-TTS: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pp. 8599–8608. PMLR, 2021.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
- Variational inference with normalizing flows. In International conference on machine learning, pp. 1530–1538. PMLR, 2015.
- Pitch estimation by multiple octave decoders. IEEE Signal Processing Letters, 28:1610–1614, 2021.
- B Series. Method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radiocommunication Assembly, 2014.
- Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4779–4783. IEEE, 2018.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 6:15, 2017.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
- Wave-tacotron: Spectrogram-free end-to-end text-to-speech synthesis. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5679–5683. IEEE, 2021.
- Audio diffusion model for speech synthesis: A survey on text to speech and speech enhancement in generative ai. arXiv preprint arXiv:2303.13336, 2023.
- Roi Benita (3 papers)
- Michael Elad (104 papers)
- Joseph Keshet (42 papers)