Papers
Topics
Authors
Recent
Search
2000 character limit reached

A-JEPA: Self-Supervised Audio Embedding

Updated 11 June 2026
  • A-JEPA is a self-supervised audio framework that predicts embeddings of masked audio segments using context and target encoders.
  • It employs Vision Transformer or Conformer backbones with curriculum masking strategies to enhance representation learning and achieve competitive benchmarks.
  • Value-guided variants integrate latent space regularization for effective tokenization and action planning, outperforming prior models on key audio tasks.

A-JEPA (Audio Joint-Embedding Predictive Architecture) encompasses a family of architectures that adapt the Joint-Embedding Predictive Architecture (JEPA) paradigm—originally developed for vision models—to audio and sequential domains. These approaches employ masked latent prediction in high-level feature spaces via context and target encoders, leveraging self-supervised pretraining for robust representation learning, robust tokenization, and, in certain variants, value-aligned world modeling for action planning.

1. Fundamental Principles and Core Architecture

A-JEPA is grounded in the JEPA principle of predicting embeddings of masked regions from visible context, operating in latent space rather than reconstructing inputs. The typical architectural backbone is a Vision Transformer (ViT) or Conformer, supporting masked patch- or segment-level prediction on time-frequency representations.

The canonical A-JEPA setup consists of:

  • Context Encoder (EθE_\theta): Processes visible (unmasked) audio spectrogram or convolutional feature patches, outputting context embeddings.
  • Target Encoder (Eθ~E_{\tilde\theta}): Shares architecture with EθE_\theta but with weights updated by exponential moving average (EMA), producing stable target embeddings.
  • Prediction Head (PÏ•P_\phi): A lightweight Transformer-based head that receives context features (with explicit mask tokens for missing patches in some variants) and predicts the latent representation of masked patches in the same embedding space as the target encoder.

The self-supervised objective is mean squared error between predicted and target patch embeddings over the masked positions: L=1∣M∣∑i∈M∥z^i−zi∥22\mathcal{L} = \frac{1}{|\mathcal{M}|} \sum_{i\in\mathcal{M}} \|\hat{z}_i - z_i\|_2^2 where M\mathcal{M} is the set of masked patch indices and ziz_i, z^i\hat{z}_i are target and predicted embeddings for patch ii.

Masking strategies range from fixed random blocks to adaptive curriculum schedules incorporating time- and frequency-aware masking, depending on the implementation (Fei et al., 2023, Tuncay et al., 25 Jun 2025).

2. Architectural Instantiations and Variants

Latent-predictive A-JEPA models (including (Tuncay et al., 25 Jun 2025, Fei et al., 2023)) process input audio as Mel-spectrograms, partitioned into non-overlapping NN patches (e.g., Eθ~E_{\tilde\theta}0 spectrogram with Eθ~E_{\tilde\theta}1 patches yields 128 patches). Each patch is embedded via linear projection and positional encoding. Typical backbone configurations:

Module Depth Embedding Dim #Heads MLP Ratio Parameters
Context/Target ViT 12 768 12 4.0 85.4M each
Predictor ViT 6 384 12 4.0 11.3M

(Tuncay et al., 25 Jun 2025)

A-JEPA with DAAM (Ioannides et al., 8 Dec 2025) incorporates a Density Adaptive Attention Mechanism after each Conformer block, using Gaussian mixture-based gating to focus on salient temporal frames. Masking is performed on temporally downsampled convolutional features, enabling low frame rate (2.5 Hz) masked prediction.

Action-planning A-JEPA (Destrade et al., 28 Dec 2025) extends JEPA to sequential decision problems by coupling the latent space geometry to a goal-conditioned value function. The standard JEPA predictor is augmented with a value-guided regularizer that encourages Euclidean or quasi-metric distances in latent space to approximate negative goal-reaching costs. This enables planning via gradient-based or sampling-based optimizers over action sequences in latent space.

3. Masking and Curriculum Strategies

A-JEPA implementations employ diverse masking schedules to optimize semantic coverage and context reasoning:

  • Random block masking: Masking rectangular regions at random (Fei et al., 2023), serving as an "easy" regime in early training.
  • Time-frequency masking: Masking entire time- or frequency-bands, promoting robustness to occlusion and capturing locally correlated structure; introduced via curriculum to gradually increase difficulty.
  • Regularized Masking (fine-tuning): During downstream training, an attention-based masking is performed in the self-attention layers, blocking a small fraction (e.g., 10%) of tokens but retaining their representations, forcing the model to use contextual information (Fei et al., 2023).

Curriculum masking schedules dynamically transition from block to band masking as model competence increases, formalized as an annealing probability over training steps.

4. Applications: Representation Learning, Tokenization, and World Modeling

Self-supervised Audio Representation Learning

A-JEPA approaches have been shown to yield state-of-the-art or competitive audio representations, suitable for both speech and non-speech domains.

Neural Tokenization and Compression

In (Ioannides et al., 8 Dec 2025), A-JEPA embeddings are quantized using Finite Scalar Quantization (FSQ) and packed via mixed-radix encoding, producing highly compressed, tokenized representations (47.5 tokens/sec at 2.5 Hz frame rate). HiFi-GAN decoders reconstruct waveform from these tokens, with perceptual quality competitive with state-of-the-art neural codecs such as SoundStream and EnCodec at significantly lower token rates.

Value-guided World Modeling and Planning

(Destrade et al., 28 Dec 2025) introduces a variant for control, shaping latent distances to reflect negative goal-values. Planning involves optimizing action sequences to minimize (quasi-)distance in latent space to the goal embedding. The induced JEPA latent geometry enables more effective gradient-based and sampling-based planning compared to unregularized JEPA, with best success rates observed for quasi-metric embeddings.

5. Implementation and Training Details

  • Preprocessing: Audio is resampled as required (16 kHz or 32 kHz), transformed to Mel-spectrograms with typical band and frame configurations depending on stride/hop.
  • Context & masking: Uniform random masking ratios (e.g., 40–60% in (Tuncay et al., 25 Jun 2025), 50% for time series in (Ioannides et al., 8 Dec 2025)); curriculum and band masking for structured context occlusion.
  • Optimization: AdamW optimizer with warmup-cosine schedules, large batch sizes (256–512), weight decay of 0.05, and EMA momentum for the target encoder.
  • Hyperparameters: Typical ViT context/target encoder depths: 12 layers; predictor: 6 layers or 2–16 for smaller heads. DAAM uses K=4 Gaussian mixtures; FSQ levels per code dimension typically set to 4.

6. Empirical Findings, Limitations, and Prospects

  • Data and compute efficiency: A-JEPA achieves strong performance with orders of magnitude less training data and compute than prior audio SSL models (Tuncay et al., 25 Jun 2025).
  • Representational strengths and limitations: A-JEPA variants excel on music/environmental audio with kNN probes; for speech-centric tasks (e.g., speaker verification, keyword spotting), further domain-specific tuning or architectural modifications (e.g., attentive pooling, audio-specialized transformers) are beneficial.
  • Masking/content alignment: Fixed block masking may fail to align with linguistic events; adaptive or content-aware strategies could improve fine-grained latent structure (Ioannides et al., 8 Dec 2025).
  • Latent geometry for planning: Value-guided regularization (VF_quasi) improves planning success, particularly in local latent neighborhoods; global alignment in high-dimensional spaces remains challenging, and stochastic environments can bias value-shaped embeddings (Destrade et al., 28 Dec 2025).
  • Extensibility: Future work includes cross-modal JEPA (audio-text, audio-vision), hierarchical/ multi-resolution latents, and semi-supervised or multi-lingual downstream adaptation.

7. Summary Table of Representative A-JEPA Variants

Paper/Variant Key Features Benchmark/Highlight
Audio-JEPA (Tuncay et al., 25 Jun 2025) ViT backbone, random masking, X-ARES Outperforms wav2vec2 on kNN
A-JEPA Can Listen (Fei et al., 2023) Curriculum masking, reg. attention mask +1.3 mAP over Audio-MAE
JEPA+DAAM as Tokenizer (Ioannides et al., 8 Dec 2025) DAAM, FSQ, mixed-radix, HiFi-GAN decode Efficient compression, quality
Value-guided A-JEPA (Destrade et al., 28 Dec 2025) Goal-value regularization, planning Success rate: up to 0.96

The A-JEPA paradigm unifies masked latent prediction, context-aware masking, and, in certain instantiations, value-guided latent space shaping to advance self-supervised audio and sequential world modeling. Empirical evidence demonstrates competitive representation learning, robust tokenization, and action planning in resource-efficient regimes (Tuncay et al., 25 Jun 2025, Fei et al., 2023, Ioannides et al., 8 Dec 2025, Destrade et al., 28 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to A-JEPA.