Determine the preferred modeling paradigm for music and audio generation

Determine which generative modeling paradigm should be preferred for music and audio generation, specifically contrasting auto-regressive decoding over discrete audio tokens with non-auto-regressive continuous-latent approaches such as diffusion and conditional flow matching, under matched data, architecture, and evaluation conditions.

Background

The paper situates music and audio generation between two successful traditions: auto-regressive decoding dominates natural language processing, while diffusion and flow-matching approaches dominate image generation. In audio, and music generation in particular, there has been no convergence on a dominant approach, with strong results reported across different paradigms.

This work provides a controlled comparison between auto-regressive decoding and conditional flow matching, holding constant data, representations, and backbone architecture to isolate the effect of the modeling paradigm. Despite providing empirical insights, the authors explicitly state that it remains unclear which approach should be followed for music and audio generation.

References

However, it is not clear what approach should we follow for music and audio generation.

— Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation (2506.08570 - Tal et al., 10 Jun 2025) in Section 1: Introduction

Determine the preferred modeling paradigm for music and audio generation

Background

References

Related Problems