Consistency of speech generation with latent diffusion audio models
Determine whether latent diffusion–based models for general audio and music generation (e.g., AudioLDM or Stable Audio–style approaches) can generate consistent, intelligible speech over extended utterances suitable for speech generation tasks.
Sponsor
References
However, these methods cannot be used in a streaming fashion, and it is unclear whether they could generate consistent speech.
— Moshi: a speech-text foundation model for real-time dialogue
(2410.00037 - Défossez et al., 17 Sep 2024) in Section 2, Related Work – Audio Language Modeling