SLAP: Siamese Language-Audio Pretraining Without Negative Samples for Music Understanding (2506.17815v1)

Published 21 Jun 2025 in cs.SD and eess.AS

Abstract: Joint embedding spaces have significantly advanced music understanding and generation by linking text and audio through multimodal contrastive learning. However, these approaches face large memory requirement limitations due to relying on large batch sizes to effectively utilize negative samples. Further, multimodal joint embedding spaces suffer from a modality gap wherein embeddings from different modalities lie in different manifolds of the embedding space. To address these challenges, we propose Siamese Language-Audio Pretraining (SLAP), a novel multimodal pretraining framework that allows learning powerful representations without negative samples. SLAP adapts the Bootstrap Your Own Latent (BYOL) paradigm for multimodal audio-text training, promoting scalability in training multimodal embedding spaces. We illustrate the ability of our model to learn meaningful relationships between music and text -- specifically, we show that SLAP outperforms CLAP on tasks such as text-music retrieval and zero-shot classification. We also observe competitive downstream performance on several MIR tasks, including with larger or supervised models (genre and instrument classification, auto-tagging). Additionally, our approach has attractive properties, such as a quantifiably reduced modality gap and improved robustness to batch size variations on retrieval performance. Finally, its novel formulation unlocks large-scale training on a single GPU through gradient accumulation.

Summary

The paper redefines multimodal learning by eliminating negative samples, addressing high memory usage and modality misalignment in music tasks.
It adapts a BYOL-inspired framework with dual encoders and balanced cosine similarity losses, achieving strong performance in retrieval, classification, and tagging.
Empirical results show improved scalability, zero-shot capabilities, and enhanced transferability of learned representations for music understanding.

SLAP: Negative-Free Siamese Language-Audio Pretraining for Music Understanding

The paper introduces SLAP (Siamese Language-Audio Pretraining), a framework for learning joint text-audio representations for music tasks without relying on negative samples, addressing two major issues in multimodal contrastive learning: large memory requirements (and limited scalability) and the modality gap. SLAP adapts the Bootstrap Your Own Latent (BYOL) paradigm, originally designed for unimodal self-supervised learning, to the multimodal domain of language and audio, and demonstrates strong empirical results on several retrieval, classification, and tagging benchmarks.

Background and Motivation

Multimodal Contrastive Learning (MCL) frameworks, such as CLIP and CLAP, have been widely used for music-text alignment by maximizing the similarity of paired audio and textual data while using a contrastive loss to push away unrelated (negative) samples. While contrastive objectives have enabled state-of-the-art results, they exhibit crucial limitations:

High Memory Requirements: Effective contrastive training for joint embedding spaces requires large batch sizes to provide diverse negatives. On modern accelerators, this leads to prohibitive memory footprints, limiting accessible batch sizes and slowing training.
Modality Gap: Contrastive paradigms often lead to embeddings from different modalities residing on disjoint manifolds, contrary to the goal of semantically meaningful joint spaces. This can reduce the effectiveness of the learned representations, especially for generative and transferable downstream tasks.

The SLAP framework is posited as a scalable and robust alternative—removing negative sampling and reducing the modality gap—without sacrificing downstream task performance.

Methodology

SLAP is structurally inspired by BYOL, using a siamese-like architecture comprising context encoders, target encoders (updated via Exponential Moving Average), and prediction heads for both modalities. The key distinctions are:

Dual Modality, Dual Direction: Each audio-text input pair is processed through modality-specific encoders. For both audio and text, context encoders produce representations passed to predictors, and target encoders (EMA copies) provide target projections.
Loss Design: Training optimizes a combination of intermodal (across modalities, cross-pair) and intramodal (within-modality) cosine similarity losses:
- Intermodal: Audio-predicted embedding vs. target text embedding, and vice versa.
- Intramodal: Audio-predicted embedding vs. audio target, and text-predicted embedding vs. text target.

Balanced loss weighting (with $\lambda$ ) is crucial; extensive ablations demonstrate that omitting either component leads to collapse or increased modality gap.

Architectural Choices: SLAP employs the HTS-AT audio transformer encoder and the RoBERTa text encoder, matching LAION-CLAP for controlled comparison. Predictors are 1-layer, ReLU MLPs with batch normalization.
Gradient Accumulation and Scalability: The method is amenable to gradient accumulation, enabling effective training at arbitrarily large effective batch sizes without the quadratic memory penalties of contrastive losses. This facilitates large-scale pretraining on single GPUs or modest clusters.

Experimental Results

SLAP is evaluated on a suite of music understanding benchmarks, both for retrieval and classification. The primary findings are:

Cross-modal Retrieval (Text-Music & Music-Text)

On MusicCaps and Song Describer, SLAP consistently outperforms CLAP and other published baselines (e.g., MusCALL, LAION-CLAP) in recall@K and mean/median normalized rank, especially when initialized with pre-trained checkpoints.
For example, on Song Describer with pretrained encoders, SLAP achieves R@1/R@5/R@10 of 5.7%/18.1%/26.6% on A→T retrieval, surpassing the corresponding CLAP values of 5.3%/14.9%/22.2%.
In ablation, using either the predicted embeddings or projection embeddings for retrieval is viable, with slight benefits for predictions.

Downstream Probing (Classification and Tagging)

When classifiers are trained on frozen representations, SLAP offers improved accuracy, AUROC, and mean Average Precision on GTZAN, MagnaTagATune, and OpenMic relative to CLAP and self-supervised audio-only models (e.g., MATPAC).
Notably, SLAP achieves 82.9% accuracy on GTZAN (genre classification), approaching the best supervised benchmarks, and 45.8% mAP on MTAT (tagging).

Zero-shot Performance

On zero-shot classification (prompt-based, no supervision) for genre, tagging, and instrument presence, SLAP outperforms CLAP and prior contrastive models. It yields 58.3% accuracy versus CLAP's 51.7% on GTZAN, and higher mAP/mAR across tasks.

Modality Gap Analysis

UMAP analyses and centroid distance/linear separability metrics indicate a substantially smaller modality gap for SLAP compared to CLAP. On multiple datasets, linear discriminability of modality drops near chance for SLAP, indicating improved alignment of the learned joint space.

Scalability

Batch size scaling studies demonstrate that SLAP exhibits stable and even slightly improved retrieval metrics as batch size increases, while CLAP's performance saturates or degrades, making negative-free learning highly practical for large-scale training.

Practical Implications and Implementation Considerations

From a systems perspective, SLAP enables practical training and deployment of music joint-embedding models for retrieval, classification, zero-shot tagging, and as a backbone for music generation or understanding pipelines. Specifically:

Memory Efficiency & Scalability: The lack of quadratic memory scaling allows SLAP to scale to large batch sizes and datasets, enabling use on standard hardware and straightforward deployment.
Robustness to Batch Size: Hyperparameter tuning is simplified as retrieval performance is less sensitive to batch size variation.
Transferability: The minimized modality gap makes SLAP embeddings more suitable for downstream generative tasks and multi-modal transfer.
Accessibility: The method is released open-source, facilitating reproducibility and further development for researchers and practitioners.

Pseudocode for SLAP Training (Simplified)

for batch in dataset:
    # Retrieve paired audio/text
    audio, text = batch

    # Forward pass through context encoders and predictors
    z_audio = context_audio_encoder(audio)
    z_text = context_text_encoder(text)
    q_audio = audio_predictor(z_audio)
    q_text = text_predictor(z_text)

    # Forward pass through target (EMA) encoders
    with torch.no_grad():
        z_audio_target = ema_audio_encoder(audio)
        z_text_target = ema_text_encoder(text)

    # Compute losses
    intermodal_loss = cosine_distance(q_audio, z_text_target) + cosine_distance(q_text, z_audio_target)
    intramodal_loss = cosine_distance(q_audio, z_audio_target) + cosine_distance(q_text, z_text_target)
    total_loss = lambda_ * intermodal_loss + (1 - lambda_) * intramodal_loss

    # Backpropagation and optimizer step
    total_loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # Update target encoders via EMA
    update_ema_parameters(context_audio_encoder, ema_audio_encoder)
    update_ema_parameters(context_text_encoder, ema_text_encoder)

Theoretical and Practical Implications

The empirical results suggest that BYOL-like negative-free paradigms are viable beyond unimodal domains. The removal of negatives results in more semantically meaningful joint spaces, improved downstream sample efficiency, and streamlined training.

Theoretically, the paper raises questions about the necessity and value of hard negatives in multi-modal representation learning and challenges assumptions about the role of contrastive objectives for modality alignment.

Practically, many music AI tasks involve retrieval, tagging, cross-modal generation, and conditional audio synthesis; SLAP's modally aligned and semantically dense embedding space serves as a superior foundation model in such pipelines. The framework is extensible to domains beyond music, including general audio, video, and other multi-modal retrieval/generation settings.

Future Directions

The paper highlights various open avenues:

Architectural Exploration: The impact of alternative backbone architectures, more sophisticated predictors, or modality-specific design remains underexplored.
Domain Extension: Generalizing SLAP to image-text, video-audio, or even higher-order modality spaces could provide insights into universally applicable joint representation learning.
Integration with Generative Models: Embedding SLAP-trained spaces into text-to-music or music-to-text generation architectures (e.g., diffusion or transformer decoders) could leverage the reduced modality gap for improved control signals.
Scaling Laws: Studying scaling properties w.r.t. data size, model size, and batch size may illuminate broader trends for negative-free multimodal learning regimes.

Summary

SLAP stands out as an effective and scalable strategy for multimodal language-audio pretraining in music understanding, achieving superior retrieval and tagging performance without negatives and addressing the modality gap inherent to prior contrastive approaches. The demonstrated memory efficiency and flexibility portend broad impact for multimodal foundation model development in music and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Juj_Guinot/status/1937889839690522889

https://twitter.com/howariou/status/1937876290079772839

https://twitter.com/Juj_Guinot/status/1938216510348055007

https://twitter.com/Juj_Guinot/status/1937901821986496790