MuLan: Joint Audio-Text Embedding Model

Updated 26 November 2025

The paper presents a novel dual-tower architecture that aligns music audio and text using a cosine similarity-based contrastive loss.
MuLan is designed for zero-shot tagging, cross-modal music retrieval, and music-domain language understanding by leveraging massive weakly-paired datasets.
Its efficient transfer learning capabilities and superior benchmark performance set a new state-of-the-art in flexible, open-vocabulary music information retrieval.

MuLan is a large-scale joint audio-text embedding model specifically designed for music information retrieval and tagging, developed to address the limitations of rigid ontology-based approaches in music content analysis. Employing a two-tower architecture, MuLan creates a unified 128-dimensional representation space in which both music audio and natural language descriptions are co-embedded, facilitating zero-shot tagging, cross-modal retrieval, and specialized music-domain language understanding—all without any fine-tuning at inference time. The model leverages massive weakly-paired datasets (370,000 hours across 44 million audio recordings with hundreds of millions of text annotations) and a contrastive cross-modal loss, thus subsuming existing ontologies and enabling unconstrained music-to-text and text-to-music tasks (Huang et al., 2022).

1. Architectural Overview

MuLan adopts a dual-tower structure for joint embedding: one tower encodes audio, the other encodes text. Each uses respective deep neural network backbones, converging to a shared, $\ell_2$ -normalized latent space of fixed dimensionality ( $d=128$ ). All distance and similarity operations utilize this shared space for cross-modal alignment via cosine similarity.

Audio Tower:

Preprocessing: Each input audio clip is converted to a log-mel spectrogram ( $F=64$ or $128$ bands, window=25 ms, hop=10 ms). During training, a random 10 s (1000-frame) window is selected per 30 s audio segment, with additional data augmentation via SpecAugment (time/frequency masking).
Backbones:
- M-ResNet-50: A ResNet-50 variant with reduced stride and pooled outputs, linearly projected to 128 dimensions and $\ell_2$ -normalized. Initialized with weights pre-trained on AudioSet (527 classes).
- M-AST: An Audio Spectrogram Transformer, ViT-base style (12×768-d Transformer blocks, 16×16 patch size, stride=10, [CLS] token, 12 heads), projecting the [CLS] output downstream to 128-dimensions, $\ell_2$ normalized, and warm-started from AST.

Text Tower:

Preprocessing: Tokenization via BERT's WordPiece lexicon with a maximum length $n=512$ .
Backbone: BERT-base (12×768-dim Transformer blocks, 12 heads) initialized from BERT-base-uncased, extracting the final [CLS] embedding, projecting to 128 dimensions, and $\ell_2$ -normalizing.

2. Joint Embedding Formulation and Contrastive Alignment

Each audio and text sample is mapped to $\mathbb{R}^{128}$ and projected onto the unit sphere:

Audio encoder $f_a$ : $f_a:\mathbb{R}^{F\times T}\rightarrow\mathbb{R}^d$ , $z_a = f_a(x_a)/\|f_a(x_a)\|_2$
Text encoder $f_t$ : $f_t:\mathcal{A}^n\rightarrow\mathbb{R}^d$ , $z_t = f_t(x_t)/\|f_t(x_t)\|_2$

The similarity critic evaluates $h(z_a, z_t) = \exp\left(\frac{z_a^T z_t}{\tau}\right)$ , where $\tau>0$ is a learnable temperature parameter. This effectively implements a cosine similarity metric within the $\ell_2$ -normalized space.

Contrastive Loss: For a minibatch $\mathcal{B}$ of $B$ paired audio/text examples $\{(x_a^{(i)}, x_t^{(i)})\}_{i=1}^B$ , the InfoNCE-based loss is: $L(\mathcal{B}) = -\sum_{i=1}^B \log\frac{h(z_a^{(i)}, z_t^{(i)})}{\sum_{j=1}^B \left[h(z_a^{(i)}, z_t^{(j)}) + h(z_a^{(j)}, z_t^{(i)})\right]}$ Positives comprise matched audio/text pairs from the same track; all other batch pairings serve as negatives.

3. Data Scale, Sources, and Annotation Balancing

Scale:

44 million music videos yield $>370,000$ hours of audio (filtered to ensure $\geq50\%$ music presence per segment).

Text Annotations:

Short-form (SF): Titles, tags; 31B tokens, $\approx$ 43 annotations/video.
Long-form (LF): Descriptions, comments; 30.7B tokens, $\approx$ 71 annotations/video.
Playlist Titles (PL): 171M playlists linked to videos (2.5B tokens; $\approx$ 24 annotations/video).
AudioSet Labels (ASET): ≈2M clips with 527 class labels, 1.8 labels/clip.

A BERT-based binary filter is used to denoise SF and LF annotations, and batch composition is balanced (SF:LF:PL:ASET = 2:2:1:1) to optimize both annotation scale and quality. Data augmentation includes SpecAugment and random cropping.

4. Evaluation Benchmarks and Experimental Findings

MuLan's effectiveness is established using several downstream tasks and evaluation metrics:

Zero-Shot Tagging:

Given $K$ candidate labels $\{t_1, ..., t_K\}$ , each is embedded as $z_{t_k}$ ; the audio sample $x_a$ is scored as $s_k = \cos(f_a(x_a), z_{t_k})$ . Evaluated on MagnaTagATune (MTAT, both Top-50 and full 188-set), AudioSet Gen-25 (genres), and AudioSet Mu-141.

Model	Gen-25	Mu-141	MTAT Top-50	MTAT All-188
M-ResNet-50	0.840	0.899	0.782	0.772
M-AST	0.840	0.909	0.778	0.776

Adding SF, LF, and PL annotations increases performance on MTAT and retrieval, though using only ASET yields best AudioSet metrics but poor MTAT results.

Linear Probes (Transfer Learning):

With frozen audio embeddings ( $z_a$ ), linear per-class logistic regression classifiers achieve Gen-25: 0.910, Mu-141: 0.940, MTAT Top-50: 0.927, All-188: 0.954—surpassing previous SOTA.

Cross-modal Retrieval:

Evaluated on 100K expert-curated tracks from 7,000 playlists using title and description queries.

Query Type	AUC	mAP
Title	0.931	0.104
Description	0.901	0.084

Performance improves significantly when incorporating more diverse text sources (SF+LF+PL) compared to ASET only.

Music-domain Language Understanding:

Triplet classification accuracy (playlist and AudioSet ontology) demonstrates that MuLan’s text tower, despite no explicit text-only loss, specializes in music semantics and outperforms generic text models in music-specific classification tasks.

Model	Playlist	AudioSet Onto.
M-ResNet-50	0.945	0.951
M-AST	0.959	0.962
SimCSE	0.950	0.938
SBERT	0.942	0.889
BERT-avg	0.850	0.847

5. Zero-Shot and Transfer Capabilities

MuLan supports arbitrary, unconstrained zero-shot tagging of music concepts across genres, moods, and instruments. The cross-modal space enables retrieval of audio from text descriptions and vice versa, as well as transfer learning for downstream Music Information Retrieval (MIR) tasks directly via frozen audio embeddings.

The model recognizes and retrieves musical concepts previously unseen at train time and demonstrates strong generalization to varying genre, tag, and descriptive queries, as validated by AUC-ROC and mean average precision metrics on benchmarks.

6. Applications, Limitations, and Open Questions

Representative Use Cases:

Zero-shot tagging for new genres, moods, and instrumentation.
Text-driven music retrieval, including complex descriptive queries (e.g., “instrumental action movie soundtrack”).
Transfer learning via frozen embeddings for downstream tasks in MIR.
Music-domain language understanding tasks, such as triplet similarity assessment and automatic captioning.

Known Limitations:

BERT’s weak handling of linguistic negation propagates to MuLan; tags such as “no vocals” or “not rock” are poorly modeled.
The binary text filtering mechanism for denoising (especially on SF & LF annotations) remains basic; more sophisticated denoising could lead to improved modeling of rare or subtle musical concepts.
There is a trade-off between batch-size and negative sample hardness, affecting the richness of the contrastive learning signal.
The current architecture is not designed for hierarchical or compositional queries, which are open areas for extension.

7. Context and Significance

MuLan represents a shift from rigid, ontology-bound systems toward fully joint, large-scale audio-text embedding models in the music domain, accommodating the diverse and unconstrained nature of music descriptions in practice. By leveraging a massive, weakly-paired dataset and robust cross-modal contrastive objectives, MuLan subsumes previous systems that relied on fixed label sets and static attributes. Its embedding space, trained without any inference-time fine-tuning, establishes a new standard for zero-shot MIR, robust music tagging, and cross-modal retrieval, as demonstrated through extensive benchmarking and ablation studies (Huang et al., 2022). This approach opens the pathway for future research into more flexible, open-vocabulary music description, retrieval, and language-driven music understanding.

PDF Markdown Chat (Pro)

References (1)

MuLan: A Joint Embedding of Music Audio and Natural Language (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MuLan Joint Audio-Text Embedding Model.