Vision NEPA: Next-Embedding Predictive Learning
- Vision NEPA is a self-supervised visual learning framework that reformulates representation learning as autoregressive next-patch embedding prediction.
- It employs a causal Vision Transformer backbone with stop-gradient stabilization to prevent collapse and ensure efficient training.
- Empirical results show state-of-the-art performance on ImageNet-1K classification (up to 85.3% accuracy) and ADE20K segmentation benchmarks.
Vision NEPA (Next-Embedding Predictive Autoregression) is a self-supervised learning framework for visual representation learning that casts the core objective as autoregressive next-latent-patch prediction. Inspired by generative pretraining paradigms in natural language processing, NEPA eschews pixel reconstruction, discrete tokenization, and contrastive losses, focusing instead on predicting future patch embeddings from prior ones within an image, using causal masking and a stop-gradient mechanism to stabilize training and prevent representational collapse. Vision NEPA achieves state-of-the-art transfer across ImageNet-1K classification and ADE20K segmentation benchmarks, demonstrating scalable performance and modality-agnostic potential (Xu et al., 18 Dec 2025).
1. Next-Embedding Predictive Objective
The fundamental learning paradigm in Vision NEPA reformulates self-supervised visual representation learning as autoregressive embedding prediction. For input image , the image is tokenized into non-overlapping patches, where for patch size . Each patch is mapped by a learned embedding layer to a -dimensional vector. The encoder output for patch is , where is a positional encoding (absolute or RoPE).
A causal Transformer predictor autoregressively produces hypotheses for the next patch embedding: The core self-supervised loss is the mean negative cosine similarity between predicted embedding and the stop-gradient version of the ground truth : where . This next-embed-predict objective avoids collapse and removes the need for auxiliary targets such as pixels, discrete codes, or negative pairs. The stop-gradient operation ensures targets are fixed with respect to encoder parameters, crucial for stable dynamics.
2. Architecture and Pretraining Pipeline
Patch Embedding and Positional Encoding
A single Conv2d layer projects image patches to embedding space. Either learned absolute position embeddings or rotary position encodings (RoPE) are used, with RoPE also providing relative positional bias in attention layers.
Vision Transformer Backbone
NEPA employs a pre-norm Vision Transformer (ViT) backbone, with two backbone variants:
- ViT-B/14 (Base)
- ViT-L/14 (Large)
Key architectural modifications include:
- SwiGLU activation replacing GeLU in MLP blocks
- LayerScale (init=1e-5) on residuals for improved optimization
- QK-Norm (parameter-free LayerNorm for queries/keys) per attention layer
- Causal masking during pretraining: token attends only to indices
Prediction and Fine-Tuning
No auxiliary prediction head is required; a linear projection of the final Transformer activations produces . For downstream tasks:
- Classification: is passed through a linear classifier and softmax
- Segmentation: UPerNet head is attached, using features at multiple scales, with bidirectional attention enabled
3. Stop-Gradient Dynamics and Ablations
The use of the stop-gradient operator in the loss is central to preventing trivial collapse. By writing the loss target as , gradients are blocked from propagating back into the patch embedding layer , so the predictor learns to regress onto a fixed target.
Ablation studies confirm that removing stop-gradient leads to collapse (loss , no downstream accuracy); including it enables stable training and successful fine-tuning, with ViT-B achieving 76.8% top-1 classification accuracy after only 50k pretraining steps. The causal shift (predicting instead of an identity mapping) and the retention of strict causal masking are both necessary—for example, omitting the causal mask drops accuracy to 73.6% (Xu et al., 18 Dec 2025).
4. Training Protocol and Data Handling
Pretraining Setup
- Dataset: ImageNet-1K (unlabeled)
- Batch Size: 4,096
- Optimizer: AdamW (=0.9, =0.95, weight decay=0.05)
- Learning Rate: (scaled by batch/256), cosine schedule, 40-epoch linear warmup
- Data Augmentation: RandomResizedCrop only; no multi-view or negative pairs
- Training Duration: ViT-B (1,600 epochs, ~3 days on 8×H100); ViT-L (800 epochs, ~5 days)
- EMA: Exponential Moving Average of weights (decay=0.9999) used for evaluation
Fine-Tuning Regimes
- Classification: Batch size 1,024, AdamW, learning rate 1×10⁻³ (scaled), cosine schedule, 5-epoch warmup, 100 fine-tuning epochs (ViT-B), 50 (ViT-L)
- Augmentations: RandAugment, MixUp (0.8), CutMix (1.0), label smoothing (0.1), DropPath (0.1/0.2)
- Layer-wise learning rate decay (ViT-B: 0.35→1.0; ViT-L: 0.60→1.0)
- Segmentation (ADE20K): 512×512 crops, batch 16, UPerNet, bidirectional attention, 160k iterations
5. Comparison with Alternative Self-Supervised Paradigms
Vision NEPA differs fundamentally from other major self-supervised strategies:
| Method | Core Mechanism | Auxiliary Heads/Targets | Negatives/Contrastive | Masking |
|---|---|---|---|---|
| NEPA | Predict next latent embedding | None | No | No* |
| MAE | Pixel reconstruction | Transformer decoder | No | Yes (75%) |
| BEiT/Discrete | Autoregressive discrete codes | Separate codebook, predictor | No | Optional |
| SimCLR/MoCo | Contrastive learning | Siamese encoders | Yes | No |
| JEPA | Predict across augmented views | Two encoders + head | No | Optional |
* Random masking can be ablated but is not required; NEPA works best without it.
NEPA uses a single-stream causal predictor (no momentum encoder, negative sampling, or codebook), predicting in continuous latent space, and does not require masking or pixel-wise losses as in MAE (Xu et al., 18 Dec 2025).
6. Empirical Performance and Ablation Outcomes
Classification (ImageNet-1K, Top-1 Accuracy)
- ViT-B/14 (NEPA-B): 83.8%
- ViT-L/14 (NEPA-L): 85.3%
Comparative scores for established self-supervised methods (all on ViT-B): MoCo v3-B (83.2%), BEiT-B (83.4%), DINO-B (83.6%), MAE-B (83.6%).
Semantic Segmentation (ADE20K, mIoU)
- NEPA-B: 48.3%
- NEPA-L: 54.0%
For reference, ViT-B with MoCo v3 achieves 47.3%, BEiT+DALLE reaches 47.1%, and MAE-B attains 48.1% mIoU.
Architectural Ablations
- Causal masking, the shift to predicting , and stop-gradient are all critical to convergence and accuracy.
- LayerScale, RoPE, QK-Norm, and SwiGLU each yield small, additive gains; their combination produces ~81.3% top-1 (with limited pretraining).
- Training the patch embedding layer outperforms freezing it.
Longer pretraining consistently improves validation accuracy, with no observed overfitting under threefold increased compute.
7. Scalability and Extensibility Across Modalities
NEPA scales favorably: using larger backbones (e.g., ViT-L) results in proportionally higher downstream task accuracy. Increasing pretraining epochs continues to yield improvements without overfitting.
The autoregressive next-embedding prediction objective mirrors next-token language modeling (e.g., GPT), with input/output embedding tying in LLMs functioning as a formal analogue. Because the NEPA paradigm operates on continuous embeddings, its core concept is extensible to audio, video, and multi-modal autoregressive learning with minimal adaptation.
Potential generative extensions include appending pixel-decoder modules or diffusion models, presenting a unified approach to both representation learning and conditional generation under a single autoregressive modeling framework (Xu et al., 18 Dec 2025).