Consistency of speech generation with latent diffusion audio models

Determine whether latent diffusion–based models for general audio and music generation (e.g., AudioLDM or Stable Audio–style approaches) can generate consistent, intelligible speech over extended utterances suitable for speech generation tasks.

Background

In discussing alternatives to autoregressive token models, the paper reviews latent diffusion approaches for audio and music modeling. While acknowledging their ability to reduce the need for hierarchical discrete tokens, the authors note that these methods are not compatible with streaming inference, and they explicitly flag uncertainty about their ability to produce coherent speech.

This uncertainty matters because the paper’s core contribution—real-time, full‑duplex speech generation—relies on models that maintain linguistic consistency across long sequences; establishing whether diffusion approaches can match this is important for future model design.

References

However, these methods cannot be used in a streaming fashion, and it is unclear whether they could generate consistent speech.

Moshi: a speech-text foundation model for real-time dialogue (2410.00037 - Défossez et al., 17 Sep 2024) in Section 2, Related Work – Audio Language Modeling