Interpretability of Text-to-Music Mapping
Investigate and elucidate how text-conditioned music generation models—such as MusicGen, MusicLM, MuseCoco, and MusicLDM—internally translate natural language descriptions into musical concept representations and subsequently produce audio, in order to achieve interpretable mechanisms of text-to-music generation.
References
Understanding how these models implicitly translate textual descriptions into musical concepts and subsequently produce music remains an unsolved task.
— Foundation Models for Music: A Survey
(2408.14340 - Ma et al., 26 Aug 2024) in Section 4.5, Interpretability and Controllability on Music Generation