Interpretability of Text-to-Music Mapping

Investigate and elucidate how text-conditioned music generation models—such as MusicGen, MusicLM, MuseCoco, and MusicLDM—internally translate natural language descriptions into musical concept representations and subsequently produce audio, in order to achieve interpretable mechanisms of text-to-music generation.

Background

The survey discusses that modern text-to-music systems deliver impressive results yet function largely as black boxes, making it difficult to extract embedded musical knowledge or determine their understanding of musical concepts.

The need for interpretability is motivated by users’ desire for controllability and transparency, as well as ethical and practical considerations in musical AI. The authors explicitly state that understanding the internal translation from textual prompts to musical outputs is still an unsolved task.

References

Understanding how these models implicitly translate textual descriptions into musical concepts and subsequently produce music remains an unsolved task.

— Foundation Models for Music: A Survey (2408.14340 - Ma et al., 26 Aug 2024) in Section 4.5, Interpretability and Controllability on Music Generation

Interpretability of Text-to-Music Mapping

Background

References

Related Problems