A-JEPA: Self-Supervised Audio Embedding
- A-JEPA is a self-supervised audio framework that predicts embeddings of masked audio segments using context and target encoders.
- It employs Vision Transformer or Conformer backbones with curriculum masking strategies to enhance representation learning and achieve competitive benchmarks.
- Value-guided variants integrate latent space regularization for effective tokenization and action planning, outperforming prior models on key audio tasks.
A-JEPA (Audio Joint-Embedding Predictive Architecture) encompasses a family of architectures that adapt the Joint-Embedding Predictive Architecture (JEPA) paradigm—originally developed for vision models—to audio and sequential domains. These approaches employ masked latent prediction in high-level feature spaces via context and target encoders, leveraging self-supervised pretraining for robust representation learning, robust tokenization, and, in certain variants, value-aligned world modeling for action planning.
1. Fundamental Principles and Core Architecture
A-JEPA is grounded in the JEPA principle of predicting embeddings of masked regions from visible context, operating in latent space rather than reconstructing inputs. The typical architectural backbone is a Vision Transformer (ViT) or Conformer, supporting masked patch- or segment-level prediction on time-frequency representations.
The canonical A-JEPA setup consists of:
- Context Encoder (): Processes visible (unmasked) audio spectrogram or convolutional feature patches, outputting context embeddings.
- Target Encoder (): Shares architecture with but with weights updated by exponential moving average (EMA), producing stable target embeddings.
- Prediction Head (): A lightweight Transformer-based head that receives context features (with explicit mask tokens for missing patches in some variants) and predicts the latent representation of masked patches in the same embedding space as the target encoder.
The self-supervised objective is mean squared error between predicted and target patch embeddings over the masked positions: where is the set of masked patch indices and , are target and predicted embeddings for patch .
Masking strategies range from fixed random blocks to adaptive curriculum schedules incorporating time- and frequency-aware masking, depending on the implementation (Fei et al., 2023, Tuncay et al., 25 Jun 2025).
2. Architectural Instantiations and Variants
Latent-predictive A-JEPA models (including (Tuncay et al., 25 Jun 2025, Fei et al., 2023)) process input audio as Mel-spectrograms, partitioned into non-overlapping patches (e.g., 0 spectrogram with 1 patches yields 128 patches). Each patch is embedded via linear projection and positional encoding. Typical backbone configurations:
| Module | Depth | Embedding Dim | #Heads | MLP Ratio | Parameters |
|---|---|---|---|---|---|
| Context/Target ViT | 12 | 768 | 12 | 4.0 | 85.4M each |
| Predictor ViT | 6 | 384 | 12 | 4.0 | 11.3M |
A-JEPA with DAAM (Ioannides et al., 8 Dec 2025) incorporates a Density Adaptive Attention Mechanism after each Conformer block, using Gaussian mixture-based gating to focus on salient temporal frames. Masking is performed on temporally downsampled convolutional features, enabling low frame rate (2.5 Hz) masked prediction.
Action-planning A-JEPA (Destrade et al., 28 Dec 2025) extends JEPA to sequential decision problems by coupling the latent space geometry to a goal-conditioned value function. The standard JEPA predictor is augmented with a value-guided regularizer that encourages Euclidean or quasi-metric distances in latent space to approximate negative goal-reaching costs. This enables planning via gradient-based or sampling-based optimizers over action sequences in latent space.
3. Masking and Curriculum Strategies
A-JEPA implementations employ diverse masking schedules to optimize semantic coverage and context reasoning:
- Random block masking: Masking rectangular regions at random (Fei et al., 2023), serving as an "easy" regime in early training.
- Time-frequency masking: Masking entire time- or frequency-bands, promoting robustness to occlusion and capturing locally correlated structure; introduced via curriculum to gradually increase difficulty.
- Regularized Masking (fine-tuning): During downstream training, an attention-based masking is performed in the self-attention layers, blocking a small fraction (e.g., 10%) of tokens but retaining their representations, forcing the model to use contextual information (Fei et al., 2023).
Curriculum masking schedules dynamically transition from block to band masking as model competence increases, formalized as an annealing probability over training steps.
4. Applications: Representation Learning, Tokenization, and World Modeling
Self-supervised Audio Representation Learning
A-JEPA approaches have been shown to yield state-of-the-art or competitive audio representations, suitable for both speech and non-speech domains.
- Evaluation protocols: Linear probing and 2-Nearest Neighbor (kNN) probes on frozen encoders across speech, music, and environmental sound classification tasks.
- Empirical outcomes: A-JEPA matches or surpasses baselines (e.g., wav2vec2, data2vec) with only one-fifth the training data and no hyperparameter tuning on X-ARES benchmark suite (Tuncay et al., 25 Jun 2025).
- Notable results: On AudioSet-2M, A-JEPA (ViT-B) achieves mean average precision (mAP) of 48.6 versus Audio-MAE’s 47.3; on ESC-50, 96.3% accuracy versus 94.1% for Audio-MAE (Fei et al., 2023).
Neural Tokenization and Compression
In (Ioannides et al., 8 Dec 2025), A-JEPA embeddings are quantized using Finite Scalar Quantization (FSQ) and packed via mixed-radix encoding, producing highly compressed, tokenized representations (47.5 tokens/sec at 2.5 Hz frame rate). HiFi-GAN decoders reconstruct waveform from these tokens, with perceptual quality competitive with state-of-the-art neural codecs such as SoundStream and EnCodec at significantly lower token rates.
Value-guided World Modeling and Planning
(Destrade et al., 28 Dec 2025) introduces a variant for control, shaping latent distances to reflect negative goal-values. Planning involves optimizing action sequences to minimize (quasi-)distance in latent space to the goal embedding. The induced JEPA latent geometry enables more effective gradient-based and sampling-based planning compared to unregularized JEPA, with best success rates observed for quasi-metric embeddings.
5. Implementation and Training Details
- Preprocessing: Audio is resampled as required (16 kHz or 32 kHz), transformed to Mel-spectrograms with typical band and frame configurations depending on stride/hop.
- Context & masking: Uniform random masking ratios (e.g., 40–60% in (Tuncay et al., 25 Jun 2025), 50% for time series in (Ioannides et al., 8 Dec 2025)); curriculum and band masking for structured context occlusion.
- Optimization: AdamW optimizer with warmup-cosine schedules, large batch sizes (256–512), weight decay of 0.05, and EMA momentum for the target encoder.
- Hyperparameters: Typical ViT context/target encoder depths: 12 layers; predictor: 6 layers or 2–16 for smaller heads. DAAM uses K=4 Gaussian mixtures; FSQ levels per code dimension typically set to 4.
6. Empirical Findings, Limitations, and Prospects
- Data and compute efficiency: A-JEPA achieves strong performance with orders of magnitude less training data and compute than prior audio SSL models (Tuncay et al., 25 Jun 2025).
- Representational strengths and limitations: A-JEPA variants excel on music/environmental audio with kNN probes; for speech-centric tasks (e.g., speaker verification, keyword spotting), further domain-specific tuning or architectural modifications (e.g., attentive pooling, audio-specialized transformers) are beneficial.
- Masking/content alignment: Fixed block masking may fail to align with linguistic events; adaptive or content-aware strategies could improve fine-grained latent structure (Ioannides et al., 8 Dec 2025).
- Latent geometry for planning: Value-guided regularization (VF_quasi) improves planning success, particularly in local latent neighborhoods; global alignment in high-dimensional spaces remains challenging, and stochastic environments can bias value-shaped embeddings (Destrade et al., 28 Dec 2025).
- Extensibility: Future work includes cross-modal JEPA (audio-text, audio-vision), hierarchical/ multi-resolution latents, and semi-supervised or multi-lingual downstream adaptation.
7. Summary Table of Representative A-JEPA Variants
| Paper/Variant | Key Features | Benchmark/Highlight |
|---|---|---|
| Audio-JEPA (Tuncay et al., 25 Jun 2025) | ViT backbone, random masking, X-ARES | Outperforms wav2vec2 on kNN |
| A-JEPA Can Listen (Fei et al., 2023) | Curriculum masking, reg. attention mask | +1.3 mAP over Audio-MAE |
| JEPA+DAAM as Tokenizer (Ioannides et al., 8 Dec 2025) | DAAM, FSQ, mixed-radix, HiFi-GAN decode | Efficient compression, quality |
| Value-guided A-JEPA (Destrade et al., 28 Dec 2025) | Goal-value regularization, planning | Success rate: up to 0.96 |
The A-JEPA paradigm unifies masked latent prediction, context-aware masking, and, in certain instantiations, value-guided latent space shaping to advance self-supervised audio and sequential world modeling. Empirical evidence demonstrates competitive representation learning, robust tokenization, and action planning in resource-efficient regimes (Tuncay et al., 25 Jun 2025, Fei et al., 2023, Ioannides et al., 8 Dec 2025, Destrade et al., 28 Dec 2025).