Long-Sequence Modelling in Music Foundation Models

Develop foundation model innovations for long-sequence music modelling that can represent and learn across minute-long, high-sampling-rate music recordings by designing pre-training strategies and model architectures capable of handling extended temporal contexts efficiently and accurately.

Background

Music typically spans long durations at high sampling rates, making end-to-end modelling challenging for foundation models compared to speech or images. The survey emphasizes that general techniques like instruction tuning and in-context learning are still emerging for music, and that handling long sequences is a domain-specific difficulty requiring architectural and training innovations.

The authors explicitly identify long-sequence modelling as one of several open problems particular to music FM development, highlighting the need for solutions that can manage extended temporal dependencies without prohibitive computation or loss of coherence.

References

In addition to this, there remain several open problems specific to music, such as long-sequence modelling (Section \ref{subsec:long_sequence_modelling}), that require foundation model innovations in pre-training strategies or model architecture methodologies etc.

Foundation Models for Music: A Survey (2408.14340 - Ma et al., 26 Aug 2024) in Section 4, Technical Details of Foundation Models