Determine the preferred modeling paradigm for music and audio generation
Determine which generative modeling paradigm should be preferred for music and audio generation, specifically contrasting auto-regressive decoding over discrete audio tokens with non-auto-regressive continuous-latent approaches such as diffusion and conditional flow matching, under matched data, architecture, and evaluation conditions.
References
However, it is not clear what approach should we follow for music and audio generation.
— Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
(2506.08570 - Tal et al., 10 Jun 2025) in Section 1: Introduction