Evaluation metrics for flexible but accurate music–multimodal alignment

Develop objective evaluation metrics for cross-modal alignment between music and other modalities (such as video) that simultaneously tolerate multiple valid pairings for a given input while accurately quantifying alignment quality, thereby balancing flexibility in artistic pairing with rigorous alignment assessment.

Background

The survey notes that many current objective metrics for multimodal music tasks borrow embeddings from general audio–LLMs, which often underrepresent music-specific characteristics and overlook temporal structure. This limits their effectiveness for assessing the correspondence between music and other modalities.

Because music–video and other music–multimodal pairings can be artistically non-unique—where multiple different music tracks may appropriately match the same visual content—evaluation must allow for multiple valid alignments. The open problem is to create metrics that respect this flexibility while still providing a precise measure of alignment quality.

References

For example, a single video may be effectively paired with multiple music tracks, each creating a distinct but coherent audiovisual experience. This raises an open challenge: how to design evaluation metrics that balance tolerance for flexible pairings with the need for alignment accuracy.

A Survey on Cross-Modal Interaction Between Music and Multimodal Data (2504.12796 - Li et al., 17 Apr 2025) in Subsection "Evaluation", Section 6 (Dataset and Evaluation)