Ascertain the factors driving performance differences across text-conditioned music generation models
Ascertain whether observed performance differences among text-conditioned music generation models arise primarily from the choice of generative modeling paradigm (e.g., auto-regressive decoding versus conditional flow matching/diffusion) or from confounding factors such as training data, latent representations, architecture design, and optimization procedures.
References
While a growing number of systems have demonstrated compelling capabilities in text-conditioned music generation, it is unclear what fundamentally accounts for performance differences across models.
— Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
(2506.08570 - Tal et al., 10 Jun 2025) in Section 1: Introduction