Self-Supervised Audio Models

Updated 25 November 2025

Self-supervised audio models are techniques that learn general-purpose embeddings from unlabeled data using proxy tasks like masked spectrogram modeling.
They employ varied neural architectures, including Transformers, Mamba, and xLSTM, to effectively capture temporal and frequency relationships.
Their transferability is proven across speech, music, and environmental sound tasks through fine-tuning and robust few-shot adaptation.

Self-supervised audio models are a class of representation learning techniques that acquire general-purpose or task-specialized audio embeddings from unlabelled data by solving proxy objectives that exploit intrinsic structure in audio signals. These models have become the foundation for speech, environmental sound, music, and multimodal audio-visual analysis. The dominant pretext tasks include masked spectrogram modeling, contrastive learning, and predictive coding, implemented via neural architectures such as Transformers, selective structured state space models (Mamba), and extended LSTMs (xLSTM). This paradigm enables scaling to massive unlabeled audio corpora and transfer to diverse downstream tasks through fine-tuning or linear probing.

1. Core Paradigms of Self-Supervised Audio Representation Learning

Self-supervised audio representations are typically learned by training neural networks to predict, reconstruct, or align parts of the input audio that are withheld or transformed. The most influential frameworks include:

Masked Spectrogram Modeling (MSM): The input spectrogram $X\in\mathbb{R}^{T\times F}$ is patched and a random proportion of patches are masked. The model, via an encoder-decoder structure (Transformer, Mamba, or xLSTM), reconstructs missing patches, minimizing mean-squared error on masked regions. Empirically, unstructured random masking at 50–80% ratio yields the richest representations for transfer (Chong et al., 2022, Yadav et al., 23 Sep 2025).
Contrastive Learning: Positive pairs are created by augmenting the same audio clip (via time-reversal, amplitude scaling, spectral mask, or mixing), while remaining clips in the batch serve as negative examples. The encoder and projection head are optimized via InfoNCE or margin-based objectives to cluster positive representations and separate negatives (Verma et al., 2020, Korbar et al., 2018, Wang et al., 2021).
Predictive Coding and Masked Prediction: Models such as wav2vec, HuBERT, and wav2vec 2.0 autoregressively or masked-predict future latent frames from past or unmasked context, using quantized targets or cluster assignments derived from k-means/VQ (Vaidya et al., 2022, Heggan et al., 2 Feb 2024).
Mixture and Polyphonic Pretexts: Recent advances incorporate audio mixtures in the pretext task, requiring that embeddings of mixtures retain information from constituent sources—addressing polyphony and source overlap common in real-world soundscapes (Alex et al., 13 Jun 2025).
Bootstrapped Latent Prediction: Student–teacher bootstrap models (e.g., Data2Vec, EAT, SSLAM) use a moving-average teacher network to provide regression targets (at frame, patch, or utterance levels) for the masked student network, accelerating convergence and decoupling the model from hard class labels (Chen et al., 7 Jan 2024, Alex et al., 13 Jun 2025).

2. Neural Sequence Modeling Architectures

The architecture of the encoder is central to the inductive biases and computational tractability of self-supervised audio models. Recent research has systematically compared:

Transformers: Self-attention over embedding sequences enables modeling arbitrary time–frequency relationships but incurs $O(N^2)$ complexity in sequence length. Variants include SSAST (Self-Supervised Audio Spectrogram Transformer) and MaskSpec (Chong et al., 2022, Yadav et al., 23 Sep 2025).
Selective Structured State Space Models (Mamba/SSAM): Mamba stacks blocks that perform adaptive, content-dependent state space convolutions across sequences, achieving $O(N)$ scaling and better extrapolation with long sequences and reduced computational footprint; Mamba encoders achieve substantial gains over Transformers at all model sizes and sequence lengths (Yadav et al., 4 Jun 2024, Yadav et al., 23 Sep 2025).
Extended LSTM (xLSTM/AxLSTM): Enhanced LSTM cells with matrix storage, exponential gates, and normalizer states capture long-term temporal dependencies, scale linearly in $N$ , and are effective—particularly for musical and pitch-driven tasks (Yadav et al., 23 Sep 2025).
Convolutional Backbones: Simpler SSL models (e.g., BYOL-A) use convolutional encoders with temporal pooling and perform well in low-resource and domain-flexible settings, especially when computational budget or deployment constraints are strong (Ogg, 4 Feb 2025, Tagliasacchi et al., 2019).

3. Training Protocols, Objectives, and Masking Strategies

Unified training regimes across architectures have been developed for fair benchmarking:

Input Preprocessing: Standardized log-mel spectrogram extraction ($16$–$32$ kHz sample rates, $80$–$128$ bins, $2$–$10$s windows) and patching ( $4\times 16$ or $16\times16$ frames) (Yadav et al., 4 Jun 2024, Alex et al., 13 Jun 2025).
Masking: Unstructured random, block, and inverse block masks are used, with empirical evidence favoring high mask ratios (≥75%) for spectrograms, maximizing sample efficiency (Chong et al., 2022, Chen et al., 7 Jan 2024).
Optimization: Large batch sizes ( $\sim$ 1000), AdamW optimizer, linear warmup and cosine decays, 80–100 epochs standard.
Objective Functions:
- Reconstruction loss: $\mathcal{L}_{\rm recon} = \frac{1}{|\mathcal{M}|} \sum_{i\in \mathcal{M}} \|\hat{x}_i - x_i\|^2$ (masked positions).
- Latent regression: Student regression on teacher features, both locally (patch) and globally (utterance-level) (Chen et al., 7 Jan 2024).
- Contrastive loss (InfoNCE): Pulls together positive pairs, separates all negatives in batch (Verma et al., 2020).
- Mixture and source retention losses: For mixture pretexts, regularize the student to preserve average source features (Alex et al., 13 Jun 2025).

4. Transfer Learning, Downstream Evaluation and Polyphony

Self-supervised models are conventionally evaluated by transferring the pretrained encoder to a suite of downstream benchmarks:

Task type	Dataset(s)	Typical Metric
General audio	AudioSet, FSD50K, ESC-50	mAP, accuracy
Speech	LibriSpeech, SpeechCommands	accuracy
Music	NSynth, MAESTRO	pitch/tonic acc.
Polyphonic sound	SPASS, URBAN-SED, DESED	multi-label mAP

Polyphonic Robustness: Most conventional SSL models are trained on monophonic or weakly-labeled event datasets. SSLAM, by including element-wise max spectrogram mixtures and source retention loss during pretraining, improves mean average precision by $+3.9\%$ on AudioSet-2M, and by up to $+9.1\%$ on strongly polyphonic soundscapes—while maintaining or improving monophonic task performance (Alex et al., 13 Jun 2025).
Domain Specificity and Flexibility: Convolutional SSL models (e.g., BYOL-A) trained on speech, non-speech or mixed diets show near-equal performance across speech and non-speech tasks, with only modest domain-specificity effects (<3% ACC) (Ogg, 4 Feb 2025).
Few-Shot Adaptation: Large-scale SSL models (wav2vec 2.0, HuBERT, WavLM) demonstrate strong transferability to few-shot audio classification tasks, achieving or surpassing meta-learned baselines, with best results linked to masked prediction objectives and Transformer encoders (Heggan et al., 2 Feb 2024).

5. Quantitative Performance and Model Trade-offs

Experimental comparisons across backbone types and scales reveal that:

Model family	Params	s(m) aggregate	Notable strengths
SSAST-Base	89 M	68.2	Standard for masked audio attention
SSAM-Base	86 M	87.9	Linear time scaling, long seqs (Yadav et al., 23 Sep 2025)
AxLSTM-Base	83 M	85.8	Recurr memory, musical tasks
EAT	90 M	—	SOTA mAP on AudioSet, 10–15x speedup
SSLAM	88 M	—	Best polyphony, SOTA AudioSet mAP

Sequence scaling: Transformers degrade with long sequences due to quadratic scaling; Mamba and xLSTM are robust and computationally more efficient for long audio, scaling linearly in input length (Yadav et al., 23 Sep 2025, Yadav et al., 4 Jun 2024).
Masking and block design: Inverse or large block masking strategies (EAT) facilitate high-ratio masked prediction with minimal performance drop, enabling faster, more efficient pretraining (Chen et al., 7 Jan 2024).
Downstream generalization: SSM-based and xLSTM-based models outperform Transformer baselines (SSAST) on both short and long tasks in min-max normalized aggregated score, especially for speech and polyphonic sound (Yadav et al., 23 Sep 2025).

6. Multimodal and Weakly-Supervised Extensions

Audio–Visual & Cross-Modal Models: Joint contrastive objectives over audio, waveform, and video encoder branches align representations across modalities, leveraging multimodal synchronization (time alignment), instrumental for audiovisual scene understanding (Wang et al., 2021, Korbar et al., 2018, Liu et al., 2022).
Attention and Weak Labeling: Self-supervised attention models, which generate pseudo-strong segment labels for attention supervision from weakly labeled clips, effectively close much of the gap to strongly supervised systems for audio event and short-duration detection (Kim et al., 2019). These schemes are practical where strong temporal labels are expensive or unavailable.
Embedding-Level SSL: Masked reconstruction can be applied not only at waveform or spectrogram but also on compact feature-embedding sequences, retaining transfer advantage in few-shot and small label regimes (Nimitsurachat et al., 2023).

7. Outlook and Practical Recommendations

Model and Pretext Selection: Transformer-based masked spectrogram models remain the reference, but selective state space models (Mamba) and extended LSTM backbones have demonstrated clear quantitative and scaling advantages and are advocated especially for longer or more complex audio.
Polyphonic Audio: Integrating mixture-invariant objectives and explicit polyphonic mixtures in pretraining is necessary to achieve SOTA performance in realistic, polyphonic environments (Alex et al., 13 Jun 2025).
Data and Domain Transfer: Mixed-domain or universal pretraining (speech + non-speech) provides strong robustness, supporting flexible downstream use with only marginal loss in matched-domain performance. For applications with limited annotations, SSL enables strong few-shot adaptation and cross-modal transfer (Ogg, 4 Feb 2025, Heggan et al., 2 Feb 2024).
Computation and Efficiency: Masked modeling with block/inverse block strategies and student–teacher bootstrapping (EAT, SSLAM) can reduce pre-training time by an order of magnitude without accuracy degradation (Chen et al., 7 Jan 2024).
Guidance for Practitioners: For general transfer and scale, SSAM (Mamba) or AxLSTM backbones are recommended for modern masked spectrogram modeling. For resource-constrained settings or rapid deployment, lightweight convolutional models remain practical, especially with BYOL- or audio2vec-style training (Tagliasacchi et al., 2019, Ogg, 4 Feb 2025). For polyphonic or complex environments, mixture-enhanced models such as SSLAM are required.

Self-supervised audio modeling has matured to deliver domain-flexible, state-of-the-art foundation models, enabling effective transfer to speech, event, music, affect, and multimodal applications, and providing a rigorous, practical arsenal for unlabelled audio exploitation (Yadav et al., 23 Sep 2025, Yadav et al., 4 Jun 2024, Alex et al., 13 Jun 2025, Chen et al., 7 Jan 2024).