CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval (2304.11029v4)

Published 21 Apr 2023 in cs.SD, cs.IR, and eess.AS

Abstract: We introduce CLaMP: Contrastive Language-Music Pre-training, which learns cross-modal representations between natural language and symbolic music using a music encoder and a text encoder trained jointly with a contrastive loss. To pre-train CLaMP, we collected a large dataset of 1.4 million music-text pairs. It employed text dropout as a data augmentation technique and bar patching to efficiently represent music data which reduces sequence length to less than 10\%. In addition, we developed a masked music model pre-training objective to enhance the music encoder's comprehension of musical context and structure. CLaMP integrates textual information to enable semantic search and zero-shot classification for symbolic music, surpassing the capabilities of previous models. To support the evaluation of semantic search and music classification, we publicly release WikiMusicText (WikiMT), a dataset of 1010 lead sheets in ABC notation, each accompanied by a title, artist, genre, and description. In comparison to state-of-the-art models that require fine-tuning, zero-shot CLaMP demonstrated comparable or superior performance on score-oriented datasets. Our models and code are available at https://github.com/microsoft/muzic/tree/main/clamp.

Citations (11)

View on Semantic Scholar

Summary

The paper presents a novel contrastive pre-training method that aligns natural language with symbolic music via a dual encoder architecture.
It employs innovative techniques like text dropout, bar patching, and a masked music model to optimize music data representation.
Experimental results demonstrate superior semantic search and zero-shot classification, validating the model’s effectiveness in symbolic MIR.

Contrastive Language-Music Pre-training for Symbolic Music Information Retrieval

The paper presents CLaMP, a method for contrastive language-music pre-training, designed to advance cross-modal symbolic music information retrieval (MIR). The focus is on aligning natural language with symbolic music to enable semantic search and zero-shot classification without requiring fine-tuning on specific tasks.

Methodology Overview

CLaMP employs a dual encoder architecture, where both music and text encoders are trained jointly using a contrastive loss function. The model leverages a newly compiled dataset, WebMusicText, consisting of 1.4 million music-text pairs, which provides a rich foundation for pre-training. A number of innovative techniques are introduced to enhance the pre-training process, including:

Text Dropout: A data augmentation strategy that randomizes textual input to improve generalization.
Bar Patching: An approach for efficiently encoding music data by transforming bars into patches, reducing sequence length by over 90% compared to traditional ABC notation.
Masked Music Model (M3): A self-supervised pre-training objective aimed at understanding musical context and structure, using noise-added patch reconstruction.

Experimental Results

The experiments conducted demonstrate that CLaMP not only performs comparably but often exceeds the performance of state-of-the-art models requiring fine-tuning. The semantic search capability of CLaMP is validated using the WikiMusicText (WikiMT) dataset, achieving a Mean Reciprocal Rank (MRR) of 0.2561 and a Hit Ratio at 1 (HR@1) of 0.1931.

For zero-shot music classification, CLaMP shows noteworthy performance, particularly on score-oriented datasets such as WikiMT and VGMIDI, despite its limitations with performance MIDI datasets like Pianist8, due to its training focus on score data. However, linear probe evaluations indicate that CLaMP can achieve substantial improvements post fine-tuning.

Implications and Future Directions

The implications of this research are multifaceted. Practically, CLaMP can streamline the process of music retrieval and classification in digital libraries, potentially impacting music recommendation systems and automated music analysis. Theoretically, the integration of language and music modalities opens avenues for further research into cross-modal representation learning, which might enhance generative models in music composition and production.

Future work should investigate scaling the dataset to include performance MIDI and other symbolic formats, thus broadening the applicability of CLaMP across diverse musical contexts. Additionally, leveraging larger transformer architectures could further improve cross-modal understanding, enabling more sophisticated MIR tasks.

In conclusion, CLaMP represents a significant contribution to the field of symbolic MIR, illustrating the potential of contrastive learning frameworks in harnessing the semantic richness of language to inform and improve music information retrieval and classification.

PDF Markdown