- The paper redefines multimodal learning by eliminating negative samples, addressing high memory usage and modality misalignment in music tasks.
- It adapts a BYOL-inspired framework with dual encoders and balanced cosine similarity losses, achieving strong performance in retrieval, classification, and tagging.
- Empirical results show improved scalability, zero-shot capabilities, and enhanced transferability of learned representations for music understanding.
SLAP: Negative-Free Siamese Language-Audio Pretraining for Music Understanding
The paper introduces SLAP (Siamese Language-Audio Pretraining), a framework for learning joint text-audio representations for music tasks without relying on negative samples, addressing two major issues in multimodal contrastive learning: large memory requirements (and limited scalability) and the modality gap. SLAP adapts the Bootstrap Your Own Latent (BYOL) paradigm, originally designed for unimodal self-supervised learning, to the multimodal domain of language and audio, and demonstrates strong empirical results on several retrieval, classification, and tagging benchmarks.
Background and Motivation
Multimodal Contrastive Learning (MCL) frameworks, such as CLIP and CLAP, have been widely used for music-text alignment by maximizing the similarity of paired audio and textual data while using a contrastive loss to push away unrelated (negative) samples. While contrastive objectives have enabled state-of-the-art results, they exhibit crucial limitations:
- High Memory Requirements: Effective contrastive training for joint embedding spaces requires large batch sizes to provide diverse negatives. On modern accelerators, this leads to prohibitive memory footprints, limiting accessible batch sizes and slowing training.
- Modality Gap: Contrastive paradigms often lead to embeddings from different modalities residing on disjoint manifolds, contrary to the goal of semantically meaningful joint spaces. This can reduce the effectiveness of the learned representations, especially for generative and transferable downstream tasks.
The SLAP framework is posited as a scalable and robust alternative—removing negative sampling and reducing the modality gap—without sacrificing downstream task performance.
Methodology
SLAP is structurally inspired by BYOL, using a siamese-like architecture comprising context encoders, target encoders (updated via Exponential Moving Average), and prediction heads for both modalities. The key distinctions are:
- Dual Modality, Dual Direction: Each audio-text input pair is processed through modality-specific encoders. For both audio and text, context encoders produce representations passed to predictors, and target encoders (EMA copies) provide target projections.
- Loss Design: Training optimizes a combination of intermodal (across modalities, cross-pair) and intramodal (within-modality) cosine similarity losses:
- Intermodal: Audio-predicted embedding vs. target text embedding, and vice versa.
- Intramodal: Audio-predicted embedding vs. audio target, and text-predicted embedding vs. text target.
Balanced loss weighting (with λ) is crucial; extensive ablations demonstrate that omitting either component leads to collapse or increased modality gap.
- Architectural Choices: SLAP employs the HTS-AT audio transformer encoder and the RoBERTa text encoder, matching LAION-CLAP for controlled comparison. Predictors are 1-layer, ReLU MLPs with batch normalization.
- Gradient Accumulation and Scalability: The method is amenable to gradient accumulation, enabling effective training at arbitrarily large effective batch sizes without the quadratic memory penalties of contrastive losses. This facilitates large-scale pretraining on single GPUs or modest clusters.
Experimental Results
SLAP is evaluated on a suite of music understanding benchmarks, both for retrieval and classification. The primary findings are:
Cross-modal Retrieval (Text-Music & Music-Text)
- On MusicCaps and Song Describer, SLAP consistently outperforms CLAP and other published baselines (e.g., MusCALL, LAION-CLAP) in recall@K and mean/median normalized rank, especially when initialized with pre-trained checkpoints.
- For example, on Song Describer with pretrained encoders, SLAP achieves R@1/R@5/R@10 of 5.7%/18.1%/26.6% on A→T retrieval, surpassing the corresponding CLAP values of 5.3%/14.9%/22.2%.
- In ablation, using either the predicted embeddings or projection embeddings for retrieval is viable, with slight benefits for predictions.
Downstream Probing (Classification and Tagging)
- When classifiers are trained on frozen representations, SLAP offers improved accuracy, AUROC, and mean Average Precision on GTZAN, MagnaTagATune, and OpenMic relative to CLAP and self-supervised audio-only models (e.g., MATPAC).
- Notably, SLAP achieves 82.9% accuracy on GTZAN (genre classification), approaching the best supervised benchmarks, and 45.8% mAP on MTAT (tagging).
Zero-shot Performance
- On zero-shot classification (prompt-based, no supervision) for genre, tagging, and instrument presence, SLAP outperforms CLAP and prior contrastive models. It yields 58.3% accuracy versus CLAP's 51.7% on GTZAN, and higher mAP/mAR across tasks.
Modality Gap Analysis
- UMAP analyses and centroid distance/linear separability metrics indicate a substantially smaller modality gap for SLAP compared to CLAP. On multiple datasets, linear discriminability of modality drops near chance for SLAP, indicating improved alignment of the learned joint space.
Scalability
- Batch size scaling studies demonstrate that SLAP exhibits stable and even slightly improved retrieval metrics as batch size increases, while CLAP's performance saturates or degrades, making negative-free learning highly practical for large-scale training.
Practical Implications and Implementation Considerations
From a systems perspective, SLAP enables practical training and deployment of music joint-embedding models for retrieval, classification, zero-shot tagging, and as a backbone for music generation or understanding pipelines. Specifically:
- Memory Efficiency & Scalability: The lack of quadratic memory scaling allows SLAP to scale to large batch sizes and datasets, enabling use on standard hardware and straightforward deployment.
- Robustness to Batch Size: Hyperparameter tuning is simplified as retrieval performance is less sensitive to batch size variation.
- Transferability: The minimized modality gap makes SLAP embeddings more suitable for downstream generative tasks and multi-modal transfer.
- Accessibility: The method is released open-source, facilitating reproducibility and further development for researchers and practitioners.
Pseudocode for SLAP Training (Simplified)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
for batch in dataset:
# Retrieve paired audio/text
audio, text = batch
# Forward pass through context encoders and predictors
z_audio = context_audio_encoder(audio)
z_text = context_text_encoder(text)
q_audio = audio_predictor(z_audio)
q_text = text_predictor(z_text)
# Forward pass through target (EMA) encoders
with torch.no_grad():
z_audio_target = ema_audio_encoder(audio)
z_text_target = ema_text_encoder(text)
# Compute losses
intermodal_loss = cosine_distance(q_audio, z_text_target) + cosine_distance(q_text, z_audio_target)
intramodal_loss = cosine_distance(q_audio, z_audio_target) + cosine_distance(q_text, z_text_target)
total_loss = lambda_ * intermodal_loss + (1 - lambda_) * intramodal_loss
# Backpropagation and optimizer step
total_loss.backward()
optimizer.step()
optimizer.zero_grad()
# Update target encoders via EMA
update_ema_parameters(context_audio_encoder, ema_audio_encoder)
update_ema_parameters(context_text_encoder, ema_text_encoder) |
Theoretical and Practical Implications
The empirical results suggest that BYOL-like negative-free paradigms are viable beyond unimodal domains. The removal of negatives results in more semantically meaningful joint spaces, improved downstream sample efficiency, and streamlined training.
Theoretically, the paper raises questions about the necessity and value of hard negatives in multi-modal representation learning and challenges assumptions about the role of contrastive objectives for modality alignment.
Practically, many music AI tasks involve retrieval, tagging, cross-modal generation, and conditional audio synthesis; SLAP's modally aligned and semantically dense embedding space serves as a superior foundation model in such pipelines. The framework is extensible to domains beyond music, including general audio, video, and other multi-modal retrieval/generation settings.
Future Directions
The paper highlights various open avenues:
- Architectural Exploration: The impact of alternative backbone architectures, more sophisticated predictors, or modality-specific design remains underexplored.
- Domain Extension: Generalizing SLAP to image-text, video-audio, or even higher-order modality spaces could provide insights into universally applicable joint representation learning.
- Integration with Generative Models: Embedding SLAP-trained spaces into text-to-music or music-to-text generation architectures (e.g., diffusion or transformer decoders) could leverage the reduced modality gap for improved control signals.
- Scaling Laws: Studying scaling properties w.r.t. data size, model size, and batch size may illuminate broader trends for negative-free multimodal learning regimes.
Summary
SLAP stands out as an effective and scalable strategy for multimodal language-audio pretraining in music understanding, achieving superior retrieval and tagging performance without negatives and addressing the modality gap inherent to prior contrastive approaches. The demonstrated memory efficiency and flexibility portend broad impact for multimodal foundation model development in music and beyond.