Audio-JEPA: Self-Supervised Audio Learning

Updated 18 November 2025

Audio-JEPA is a self-supervised audio representation learning paradigm based on JEPA that predicts latent representations from masked spectrogram patches.
It employs Transformer-based context and target encoders with a lightweight predictor, using random patch masking to encourage high-level feature extraction.
Empirical evaluations on diverse benchmarks for speech, music, and environmental sounds show competitive performance with significantly less pre-training data than comparable models.

Audio-JEPA (Audio Joint-Embedding Predictive Architecture) is a self-supervised learning paradigm for audio representation learning built on the Joint-Embedding Predictive Architecture (JEPA) principle. The methodology predicts high-level latent representations of masked regions in spectrogram space, eschewing raw waveform or spectrogram pixel reconstruction. Audio-JEPA relies on Transformer-based context and target encoders—typically Vision Transformer (ViT) backbones adapted for audio spectrogram input—alongside a lightweight predictor architecture. It leverages purely random masking of spectrogram patches and minimizes a latent-space prediction loss via exponential moving average target networks. The paradigm has demonstrated competitive performance on speech, music, and environmental sound benchmarks with significantly less pre-training data than comparable models such as wav2vec 2.0 and data2vec (Tuncay et al., 25 Jun 2025).

1. Architectural Formulation

Audio-JEPA processes raw audio by converting each 10 s, 32 kHz clip to a 128-band Mel-spectrogram with 256 time-bins. The spectrogram "image" is decomposed into 16×16 non-overlapping patches, resulting in 128 discrete regions per example. Each patch $x_i \in \mathbb{R}^{16 \times 16}$ is flattened and linearly projected to a $d=768$ dimensional embedding, augmented with a positional encoding.

The core model comprises:

Context encoder $f_\phi$ and target encoder $f_{\bar{\phi}}$ : both implement a ViT backbone with 12 layers, 12 attention heads, MLP ratio 4.0, per-encoder parameter count $~85.4$ M.
Predictor $g_\theta$ : a 6-layer ViT, embedding dimension 384, re-projecting back to 768, parameter count $~11.3$ M. $g_\theta$ is active only during training.

The context encoder processes visible patches (with a random subset masked out), followed by the predictor generating estimates of the masked regions in latent space. The target encoder is always presented with the full spectrogram (both masked and unmasked regions), and is updated via the exponential moving average (EMA) of context encoder weights. The forward computation proceeds as:

Randomly mask a subset $M$ of the 128 patches.
$x_{\setminus M} \xrightarrow{f_\phi} C$
$C \xrightarrow{g_\theta} P$ (mask position predictions)
$x \xrightarrow{f_{\bar{\phi}}} Z$ (full spectrogram targets)

2. Masking Strategies and Domain Differences

Audio-JEPA adopts purely random patch masking (as opposed to block or contiguous masking), with masking ratio $\rho \sim \mathcal{U}(0.4,0.6)$ sampled per batch, averaging to approximately 50% hidden patches. Ablations revealed that traditional block masking strategies beneficial in image JEPA (I-JEPA) architectures are suboptimal for audio; audio events frequently span wide frequency bands, making local contiguity less critical (Tuncay et al., 25 Jun 2025, Riou et al., 14 May 2024).

Different masking domains are possible:

Audio-domain masking: Target patches are removed from the spectrogram prior to encoder input.
Latent-domain masking: Target encoder receives the full (unmasked) spectrogram; only the context encoder input is masked.

Empirical results favor unstructured random masking in the audio domain, outperforming time-only, multi-block, or latent-domain masks (Riou et al., 14 May 2024). Masking in the audio domain enforces high-level feature learning, mitigating leakage of low-level detail.

3. Latent-Prediction Objective and Optimization

The model is optimized to minimize the distance between predictor outputs and target encoder embeddings over masked patches. For Audio-JEPA, the loss is average squared $\ell_2$ distance: $\mathcal{L}(\theta,\phi) = \frac{1}{|M|} \sum_{i \in M} \|P_i - Z_i\|_2^2 = \mathbb{E}_{x, M} \|g_\theta(f_\phi(x_{\setminus M})) - f_{\bar{\phi}}(x)\|_2^2$ Only $\phi$ and $\theta$ are updated by gradient descent; $\bar{\phi}$ via EMA: $\bar{\phi} \leftarrow \tau\,\bar{\phi} + (1-\tau)\,\phi$ with $\tau$ annealed as in Bootstrap Your Own Latent (BYOL), typically approaching 1 over training (Tuncay et al., 25 Jun 2025).

Alternative JEPA variants (e.g., using a Huber $L_1$ loss or curriculum masking schedules) have been explored, improving robustness to context masking and enhancing semantic feature focus (Fei et al., 2023, Riou et al., 14 May 2024).

4. Pre-training and Downstream Evaluation

Audio-JEPA models are pretrained on large-scale unlabeled corpora, typically AudioSet (≈1.9 M clips, 5,338 hours of audio). Training proceeds for 100k steps (≈13 epochs), batch size 256 clips, optimizer AdamW with weight decay 0.05, and a warmup/cosine LR schedule peaking at $3 \times 10^{-4}$ (Tuncay et al., 25 Jun 2025).

Evaluation encompasses 21 datasets (the X-ARES suite), spanning:

Speech: ASV2015, Fluent Speech Commands, Speech Commands v1, VoxCeleb1, VoxLingua33, Librispeech.
Music: GTZAN genre, Free Music Archive, NSynth.
Environmental sound: ESC-50, UrbanSound8K, FSD50K.

Two evaluation protocols are employed:

Linear probe: Single-layer MLP head on frozen context encoder outputs.
k-Nearest Neighbor (kNN): Distance-based retrieval in embedding space.

Key findings:

Audio-JEPA matches or exceeds wav2vec 2.0 / data2vec on many music and ambient tasks under kNN, despite $<$ 20% the pre-training data.
Linear probe accuracy is competitive for non-speech, but lags slightly for fine-grained speech distinctions, attributed to the lack of explicit linear separability enforced by the loss (Tuncay et al., 25 Jun 2025, Fei et al., 2023).

5. Ablations and Recommended Practices

Ablation studies have investigated the impact of masking strategies, context-target segment duration, loss function variants, and encoder/predictor size. Recommended configuration for robust general-purpose audio JEPA per (Riou et al., 14 May 2024):

Unstructured random masking ( $\sim$ 50% patches).
Audio-domain masking (mask input to both encoders).
ViT-Base context encoder (12 × 768), small ViT predictor (typically 8 × 512).
Training segment: $\sim$ 2 s for speech/instrument, $\sim$ 6 s for music/environmental sound.
Track targets via EMA of context weights.

Table: Linear-probe accuracies for representative models (Riou et al., 14 May 2024):

Model	ESC-50	US8K	GTZAN	NSynth
A-JEPA (2.1s)	89.3	87.2	82.1	76.8
A-JEPA (6.4s)	90.0	87.7	86.9	74.9
M2D-SOTA	89.7	87.6	83.3	75.3

6. Extensions, Variants, and Broader Impact

Audio-JEPA has catalyzed the development of both task-specific and more general architectures:

WavJEPA generalizes JEPA to raw waveforms, eliminating spectrogram computation and phase loss, resulting in highly robust, low-latency time-domain models. WavJEPA-Nat extends this to multichannel naturalistic acoustic scenes for binaural perception, outperforming other time-domain foundation models on HEAR and ARCH tasks (Yuksel et al., 27 Sep 2025).
Stem-JEPA adapts JEPA principles to multi-track audio for musical stem compatibility estimation, utilizing context/target encoders and instrument-label conditioning to predict stem embeddings from a mix. This framework achieves strong retrieval performance on MUSDB18 and competitive downstream MIR task transfer, contingent on separated stem availability (Riou et al., 5 Aug 2024).

Editor’s term JEPA-computational efficiency: Audio-JEPA and WavJEPA attain competitive benchmarks with one-fifth or less the training data and model size compared to prior approaches. A plausible implication is that latent-space prediction, rather than raw signal reconstruction, is sufficient for learning semantically rich, transferable audio features.

7. Limitations and Future Research Directions

Current limitations include:

Absence of explicit label information during pre-training, potentially reducing linear probe accuracy for certain spoken-language tasks.
No systematic hyperparameter tuning in foundational works, leaving mask ratio, EMA rate, and optimizer choices underexplored (Tuncay et al., 25 Jun 2025).
Dependence on stem separation quality and instrument taxonomy granularity in musical compatibility approaches (Riou et al., 5 Aug 2024).
Generative use cases (e.g., separation, enhancement, synthesis) are at a nascent stage; direct application of JEPA embeddings is plausible but unproven for streaming generative models (Yuksel et al., 27 Sep 2025).

Potential future directions:

Hybrid training with both general audio and speech-specific corpora to close the ASR/speaker ID gap in WavJEPA.
Incorporation of attention pooling mechanisms (e.g., as in V-JEPA) to improve fine-grained classification performance in downstream probing (Tuncay et al., 25 Jun 2025).
Expansion of musical stem categories and joint cross-modal prediction.
Scaling WavJEPA-Nat to more diverse acoustic environments and higher-order ambisonics for spatial audio understanding.

Audio-JEPA and its derivatives constitute a robust paradigm for audio representation learning, capable of efficient generalization across diverse audio domains with modest data and compute. The field continues to evolve toward phase-aware, context-rich, and compatibility-driven representations underpinning next-generation audio foundation models.