Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MuLan: A Joint Embedding of Music Audio and Natural Language (2208.12415v1)

Published 26 Aug 2022 in eess.AS, cs.CL, cs.SD, and stat.ML

Abstract: Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Qingqing Huang (16 papers)
  2. Aren Jansen (25 papers)
  3. Joonseok Lee (39 papers)
  4. Ravi Ganti (15 papers)
  5. Judith Yue Li (6 papers)
  6. Daniel P. W. Ellis (16 papers)
Citations (117)

Summary

An Overview of MuLan: A Joint Embedding of Music Audio and Natural Language

Introduction

The paper "MuLan: A Joint Embedding of Music Audio and Natural Language" presents a significant contribution to the field of music information retrieval by introducing a new model for linking music audio with natural language descriptions. The MuLan model operates without predefined, rigid ontologies, using a joint audio-text embedding approach. By leveraging 44 million music recordings and their associated free-form text annotations, the model produces a representation that supports a variety of functionalities, such as zero-shot music tagging, language understanding, and cross-modal retrieval. This essay provides a detailed examination of the methodologies, experimental results, and implications of this research.

Methodology

MuLan employs a two-tower model architecture where audio and text inputs are processed independently but map into a shared embedding space. The audio embedding network utilizes either a Resnet-50 or Audio Spectrogram Transformer (AST), designed to handle log mel spectrogram inputs. The text embedding relies on a Bidirectional Encoder Transformer (BERT) with a fixed-length input sequence. Training is performed using cross-modal contrastive learning on a large-scale dataset composed of paired music audio and text annotations.

The large dataset, essential for this endeavor, comprises text from video metadata, comments, and playlists, alongside music content filtered by an audio classifier. This scale facilitates robust model training and enhances the model’s capacity for zero-shot learning and cross-modal retrieval tasks.

Experiments and Results

Zero-shot Music Tagging

MuLan's performance was evaluated on tasks including zero-shot music tagging. Using datasets such as MagnaTagATune (MTAT) and AudioSet, the model demonstrated exceptional generalization capabilities. Importantly, MuLan's embeddings achieved state-of-the-art performance when applied to linear probes for downstream music tagging tasks, surpassing other pretrained models and end-to-end trained counterparts.

Cross-modal Music Retrieval

Efficiencies in retrieval tasks underline MuLan's effectiveness. By testing on curated playlists, the model successfully retrieved relevant music recordings based on textual queries, demonstrating its applicability in scenarios where metadata alone is insufficient.

Text Embedding Evaluation

The paper also provides a comparison of MuLan's text embeddings against several baselines, asserting that fine-tuning with cross-modal objectives significantly augments text understanding in the music domain. The triplet classification task confirms the superiority of the MuLan text encoder over generic sentence embedding models, substantively bridging the gap between music concepts and textual descriptions.

Implications and Future Directions

MuLan's successful integration of music audio and natural language presents implications for advancing AI-driven music retrieval systems. Its ability to handle free-form text queries without relying on strict ontologies aligns with current trends toward more flexible, user-friendly AI applications. The model supports diverse music genres and text annotations, providing a robust tool for music recommendations and discovery.

The paper hints at potential advancements through enhanced text filtering. By better distinguishing relevant content from noise, further refinement could be achieved in handling nuanced language constructs, enhancing detection of rare and subtle concepts.

Conclusion

The research on MuLan embodies a critical stride toward genuinely interactive and intelligent music information systems through its innovative joint embedding framework. Future work might expand on multimodal parameter sharing, extensive domain adaptation, or improvements to noise filtration, each promising further refinement of MuLan's capabilities. Such efforts will undoubtedly continue to reshape our interaction with music data in increasingly sophisticated ways.

Youtube Logo Streamline Icon: https://streamlinehq.com