Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 180 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 66 tok/s Pro
Kimi K2 163 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

SSLAM: Self-Supervised Audio Mixtures

Updated 10 November 2025
  • SSLAM is a self-supervised audio representation method that generates synthetic polyphony during pre-training to bridge the gap between monophonic benchmarks and real-world scenes.
  • It employs a masked latent bootstrapping architecture with a Vision Transformer backbone and an auxiliary CNN decoder to optimize patch-level and global features.
  • By using a Source Retention Loss, SSLAM effectively disentangles overlapping sources, achieving state-of-the-art performance on both monophonic and polyphonic tasks.

Self-Supervised Learning from Audio Mixtures (SSLAM) is a method for audio representation learning specifically designed to address the limitations of standard self-supervised learning (SSL) approaches in polyphonic, real-world audio environments. Whereas most prior SSL methods are developed and evaluated using predominantly monophonic data, SSLAM explicitly targets the generalization gap to natural polyphonic scenarios—such as crowded soundscapes or multi-instrument recordings—by mixing unlabeled audio during pre-training and imposing novel representation constraints. The framework consistently improves polyphonic robustness while preserving or exceeding state-of-the-art monophonic performance, redefining the standard protocol for general-purpose audio SSL.

1. Motivation and Problem Formulation

In practical audio settings, polyphony—not monophony—is the norm, with multiple distinct sources frequently active simultaneously. Standard SSL models commonly employed in large-scale systems, including multi-modal LLMs, are generally benchmarked on datasets like AudioSet, which contains a predominance of monophonic or only weakly polyphonic clips—AudioSet analysis shows only ≈42.5% of examples at hierarchy level 1 feature two distinct events. As a result, backbones pre-trained on such data demonstrate impaired generalization when deployed in scenarios with dense overlapping sources, such as urban soundscapes, music ensembles, or environments with concurrent speech and ambient noise.

SSLAM was introduced to explicitly address these issues: (a) through in-batch mixing to generate synthetic polyphony during pre-training, and (b) via a Source Retention Loss (SRL) that forces the student network to preserve the latent representations of each original source throughout training. This formulation targets the fundamental challenge of learning disentangled, robust representations in complex auditory scenes.

2. Model Architecture

SSLAM employs a masked latent bootstrapping architecture based on the Vision Transformer-Base (ViT-Base) encoder for both the student and teacher networks, instantiated with 93M parameters during pre-training and 88M for fine-tuning. The core ViT components are left unchanged; SSLAM’s innovations are confined to the data pipeline, loss computation, and the auxiliary lightweight 6-layer CNN decoder for patch-level feature regression. A key architectural decision was discarding MixIT-style mixture-invariant concept-separation, as convergence and independence-constraint issues were observed.

The model diagram may be represented as follows:

1
2
3
4
5
AudioSet Clip(s) --> log-mel spectrogram --> Mask(s)
                                        |          |
                              +------ViT Encoder----+ (Student and EMA Teacher)
                              |                    |
                          CNN Decoder          Latent Targets

The SSLAM framework modifies how inputs are mixed and masked, and how objective functions are defined, but retains compatibility with off-the-shelf ViT backbones.

3. Self-Supervised Learning Objectives

3.1 Masked Latent Bootstrapping

The baseline objective operates with inverse block multi-masking (size 5×5, 80% mask ratio), producing nm=16n_m=16 (Stage 1) or nm=8n_m=8 (Stage 2) masked clones per example. The student model receives masked spectrograms, generating predictions both per-patch (Y^patch\hat{Y}^{patch}) and as a global CLS embedding (Y^CLS\hat{Y}^{CLS}), while the Exponential Moving Average (EMA) teacher processes the unmasked input to provide latent targets—layer-wise features ZZ_\ell. The global loss is computed as the MSE between CLS embeddings (student vs teacher), either averaged across all layers (unmixed) or using only the last teacher layer (mixed):

Lglobal,unmixed=1Bnmi=1Bj=1nmY^i,jCLSZiCLS2L_{global,unmixed} = \frac{1}{B\cdot n_m} \sum_{i=1}^B \sum_{j=1}^{n_m} \|\hat{Y}^{CLS}_{i,j} - Z^{CLS}_{i}\|^2

The local loss is a patch-wise squared error, evaluated similarly.

3.2 Audio Mixing

To generate synthetic polyphony, SSLAM mixes two log-mel spectrograms S1(f,τ)S_1(f, \tau) and S2(f,τ)S_2(f, \tau) using an element-wise maximum:

Smixed(f,τ)=max(S1(f,τ),S2(f,τ))S_{mixed}(f, \tau) = \max\left(S_1(f, \tau), S_2(f, \tau)\right)

Partial mixing is employed: only half of each clip’s duration (two contiguous t/4t/4 segments) is mixed, preserving distinct unmixed regions and thereby retaining dominant sound events. Losses on mixed inputs are analogous to the unmixed case, with global loss referencing only the teacher’s last layer (to avoid over-compression) and local loss averaging across all teacher layers.

3.3 Source Retention Loss

The Source Retention Loss (SRL) encourages explicit retention of information pertaining to each individual source. The student’s patch-level predictions on a mixed input are optimized to match the average of the corresponding teacher representations from both unmixed sources:

LSRL=1BnmMi,j,kMY^i,j,kpatch,mixedZi,kS2+Zi,kS122L_{SRL} = \frac{1}{B\cdot n_m \cdot |\mathcal{M}|} \sum_{i,j,k \in \mathcal{M}} \left\|\hat{Y}^{patch,mixed}_{i,j,k} - \frac{Z^{S_2}_{i,k} + Z^{S_1}_{i,k}}{2}\right\|^2

This loss directly enforces that information about each contributing source is preserved in the mixture representation.

4. Training Protocol

4.1 Data Processing and Mixture Generation

SSLAM is pre-trained on AudioSet-2M (1.91M clips), with the following preprocessing: waveform resampled at 16 kHz and transformed into 128-dimensional log-mel spectrograms (25 ms window, 10 ms hop size). Mixtures are generated in-batch by circularly rolling batch elements and applying element-wise maximum on spectrograms.

4.2 Two-Stage Curriculum

Training is organized as a curriculum:

  • Stage 1 (10 epochs): Training on unmixed audio using Lglobal,unmixed+Llocal,unmixedL_{global,unmixed} + L_{local,unmixed}, learning monophonic structure.
  • Stage 2 (5 epochs): Resuming from Stage 1 weights, training on 50% mixed and 50% unmixed batches, optimizing the suite of five losses: {Lglobal,unmixed,Llocal,unmixed,Lglobal,mixed,Llocal,mixed,LSRL}\{ L_{global,unmixed}, L_{local,unmixed}, L_{global,mixed}, L_{local,mixed}, L_{SRL} \}.

4.3 Optimization Details

The AdamW optimizer is used (β1=0.9,β2=0.95\beta_1=0.9, \beta_2=0.95, weight decay 0.05) with cosine learning rate annealing and warm-up. Stage 1 employs a peak learning rate of 5×1045 \times 10^{-4}; Stage 2 uses 5×1055 \times 10^{-5}, both decaying to 1×1061 \times 10^{-6}. Batch sizes are 12 clips with 16 (Stage 1) or 8 (Stage 2) masked clones per example. Computation is performed on 4× NVIDIA 3090 GPUs, requiring approximately 7 hours per epoch in Stage 1 and 7.5 hours in Stage 2.

5. Evaluation Benchmarks and Protocols

The evaluation spans both monophonic and polyphonic scenarios:

  • Monophonic:
    • AudioSet-2M (AS-2M) and AS-20K (multi-label tagging, mAP)
    • ESC-50 (environmental sounds, accuracy)
    • Google Speech Commands KS1/KS2 (accuracy)
  • Polyphonic:
    • SPASS (urban scenario tagging, mAP)
    • IDMT-DESED-FL (sound event detection)
    • URBAN-SED (filtered for >1 event)
    • Degrees of Polyphony sets (differing numbers of concurrent events)

Protocols:

  • Linear evaluation: Backbone frozen, a linear classifier is trained for 50 epochs (polyphonic) or up to 400k steps (AS-2M).
  • Fine-tuning: All backbone and head parameters are optimized on labeled data.
  • Metrics: mAP for tagging tasks; accuracy for ESC-50 and Speech Commands.

6. Empirical Results and Ablation Analyses

6.1 Monophonic Performance

SSLAM achieves state-of-the-art results on standard benchmarks:

Model #Param AS-2M mAP AS-20K mAP ESC-50 Acc KS2 Acc KS1 Acc
MaskSpec (2023) 86 M 47.1 32.3 89.6 97.7
Audio-MAE (2022) 86 M 47.3 37.1 94.1 98.3 96.9
BEATs iter3 (2022) 90 M 48.0 38.3 95.6 98.3 97.7
ASiT (2024) 86 M 48.0 38.6 95.3 98.9 98.2
A-JEPA (2024) 86 M 48.6 38.4 96.3 98.5 97.7
EAT (2024) 88 M 48.6 40.2 95.9 98.3
SSLAM 88 M 50.2 40.9 96.2 98.1 98.8

SSLAM demonstrates a +3.9% mAP gain on AS-2M relative to prior SOTA.

6.2 Polyphonic Evaluation and Ablations

SSLAM surpasses baselines in linear evaluation and fine-tuning on SPASS, IDMT, URBAN, and varied-polyphony subsets. For SPASS Market, fine-tuning yields +9.1% absolute mAP gain (89.7 → 90.2). Notably, SSLAM's performance advantage over the MB-UA baseline increases with higher polyphony (up to +9.7% mAP for 8–9 concurrent events).

Ablation results indicate:

  • Best teacher layer aggregation is last layer for global, all layers for local loss.
  • Partial mixing of spectrograms is superior to full mixing.
  • MixIT variants yielded inferior convergence and lower mAP.
  • Averaging in SRL is preferred over element-wise max aggregation.
  • Robustness is high: ±0.1 mAP std. across three independent runs.

7. Significance, Limitations, and Future Directions

SSLAM advances audio SSL by bridging the gap between monophonic benchmarks and real-world polyphonic environments. Its principal contributions are (1) injecting controlled polyphony during pre-training, and (2) retaining event-specific representations within mixtures via SRL—all achieved without architectural complexity or tuning of the underlying transformer backbone. The two-stage training curriculum, partial mixture strategy, and considered teacher-layer selection are identified as key contributors to its empirical success.

Limitations include current restriction to two-source mixtures (potentially limiting multi-entity disentanglement), reliance on spectrogram-domain mixing (absence of spatial or binaural cues), and lack of robust concept-separation objectives overcoming the independence assumptions in MixIT.

Future research directions comprise extending mixture complexity, integrating spatial information for scene analysis, and developing stronger, independence-agnostic concept separation mechanisms for source disentanglement.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SSLAM.