Vision NEPA: Next-Embedding Predictive Learning

Updated 23 December 2025

Vision NEPA is a self-supervised visual learning framework that reformulates representation learning as autoregressive next-patch embedding prediction.
It employs a causal Vision Transformer backbone with stop-gradient stabilization to prevent collapse and ensure efficient training.
Empirical results show state-of-the-art performance on ImageNet-1K classification (up to 85.3% accuracy) and ADE20K segmentation benchmarks.

Vision NEPA (Next-Embedding Predictive Autoregression) is a self-supervised learning framework for visual representation learning that casts the core objective as autoregressive next-latent-patch prediction. Inspired by generative pretraining paradigms in natural language processing, NEPA eschews pixel reconstruction, discrete tokenization, and contrastive losses, focusing instead on predicting future patch embeddings from prior ones within an image, using causal masking and a stop-gradient mechanism to stabilize training and prevent representational collapse. Vision NEPA achieves state-of-the-art transfer across ImageNet-1K classification and ADE20K segmentation benchmarks, demonstrating scalable performance and modality-agnostic potential (Xu et al., 18 Dec 2025).

1. Next-Embedding Predictive Objective

The fundamental learning paradigm in Vision NEPA reformulates self-supervised visual representation learning as autoregressive embedding prediction. For input image $x \in \mathbb{R}^{H \times W \times C}$ , the image is tokenized into $T$ non-overlapping patches, where $T = (H/P) \cdot (W/P)$ for patch size $P \times P$ . Each patch is mapped by a learned embedding layer $f$ to a $D$ -dimensional vector. The encoder output for patch $t$ is $z_t = f(\text{patch}_t) + p_t$ , where $p_t$ is a positional encoding (absolute or RoPE).

A causal Transformer predictor $h_\theta$ autoregressively produces hypotheses for the next patch embedding: $\hat{z}_{t+1} = h_\theta(z_1, \dots, z_t)$ The core self-supervised loss is the mean negative cosine similarity between predicted embedding $\hat{z}_{t+1}$ and the stop-gradient version of the ground truth $z_{t+1}$ : $\mathcal{L} = \frac{1}{T-1} \sum_{t=1}^{T-1} \ell_\text{cos}\left( \hat{z}_{t+1},~ \mathrm{stopgrad}(z_{t+1}) \right)$ where $\ell_\text{cos}(a,b) = -\frac{a}{\|a\|_2} \cdot \frac{b}{\|b\|_2}$ . This next-embed-predict objective avoids collapse and removes the need for auxiliary targets such as pixels, discrete codes, or negative pairs. The stop-gradient operation ensures targets are fixed with respect to encoder parameters, crucial for stable dynamics.

2. Architecture and Pretraining Pipeline

Patch Embedding and Positional Encoding

A single Conv2d layer projects $P \times P$ image patches to embedding space. Either learned absolute position embeddings or rotary position encodings (RoPE) are used, with RoPE also providing relative positional bias in attention layers.

Vision Transformer Backbone

NEPA employs a pre-norm Vision Transformer (ViT) backbone, with two backbone variants:

ViT-B/14 (Base)
ViT-L/14 (Large)

Key architectural modifications include:

SwiGLU activation replacing GeLU in MLP blocks
LayerScale (init=1e-5) on residuals for improved optimization
QK-Norm (parameter-free LayerNorm for queries/keys) per attention layer
Causal masking during pretraining: token $t$ attends only to indices $\leq t$

Prediction and Fine-Tuning

No auxiliary prediction head is required; a linear projection of the final Transformer activations produces $\hat{z}_{t+1}$ . For downstream tasks:

Classification: $z_T$ is passed through a linear classifier and softmax
Segmentation: UPerNet head is attached, using features at multiple scales, with bidirectional attention enabled

3. Stop-Gradient Dynamics and Ablations

The use of the stop-gradient operator in the loss is central to preventing trivial collapse. By writing the loss target as $\mathrm{stopgrad}(z_{t+1})$ , gradients are blocked from propagating back into the patch embedding layer $f$ , so the predictor learns to regress onto a fixed target.

Ablation studies confirm that removing stop-gradient leads to collapse (loss $\to -1$ , no downstream accuracy); including it enables stable training and successful fine-tuning, with ViT-B achieving 76.8% top-1 classification accuracy after only 50k pretraining steps. The causal shift (predicting $z_{t+1}$ instead of an identity mapping) and the retention of strict causal masking are both necessary—for example, omitting the causal mask drops accuracy to 73.6% (Xu et al., 18 Dec 2025).

4. Training Protocol and Data Handling

Pretraining Setup

Dataset: ImageNet-1K (unlabeled)
Batch Size: 4,096
Optimizer: AdamW ( $\beta_1$ =0.9, $\beta_2$ =0.95, weight decay=0.05)
Learning Rate: $3 \times 10^{-4}$ (scaled by batch/256), cosine schedule, 40-epoch linear warmup
Data Augmentation: RandomResizedCrop only; no multi-view or negative pairs
Training Duration: ViT-B (1,600 epochs, ~3 days on 8×H100); ViT-L (800 epochs, ~5 days)
EMA: Exponential Moving Average of weights (decay=0.9999) used for evaluation

Fine-Tuning Regimes

Classification: Batch size 1,024, AdamW, learning rate 1×10⁻³ (scaled), cosine schedule, 5-epoch warmup, 100 fine-tuning epochs (ViT-B), 50 (ViT-L)
- Augmentations: RandAugment, MixUp (0.8), CutMix (1.0), label smoothing (0.1), DropPath (0.1/0.2)
- Layer-wise learning rate decay (ViT-B: 0.35→1.0; ViT-L: 0.60→1.0)
Segmentation (ADE20K): 512×512 crops, batch 16, UPerNet, bidirectional attention, 160k iterations

5. Comparison with Alternative Self-Supervised Paradigms

Vision NEPA differs fundamentally from other major self-supervised strategies:

Method	Core Mechanism	Auxiliary Heads/Targets	Negatives/Contrastive	Masking
NEPA	Predict next latent embedding	None	No	No*
MAE	Pixel reconstruction	Transformer decoder	No	Yes (75%)
BEiT/Discrete	Autoregressive discrete codes	Separate codebook, predictor	No	Optional
SimCLR/MoCo	Contrastive learning	Siamese encoders	Yes	No
JEPA	Predict across augmented views	Two encoders + head	No	Optional

* Random masking can be ablated but is not required; NEPA works best without it.

NEPA uses a single-stream causal predictor (no momentum encoder, negative sampling, or codebook), predicting in continuous latent space, and does not require masking or pixel-wise losses as in MAE (Xu et al., 18 Dec 2025).

6. Empirical Performance and Ablation Outcomes

Classification (ImageNet-1K, Top-1 Accuracy)

ViT-B/14 (NEPA-B): 83.8%
ViT-L/14 (NEPA-L): 85.3%

Comparative scores for established self-supervised methods (all on ViT-B): MoCo v3-B (83.2%), BEiT-B (83.4%), DINO-B (83.6%), MAE-B (83.6%).

Semantic Segmentation (ADE20K, mIoU)

NEPA-B: 48.3%
NEPA-L: 54.0%

For reference, ViT-B with MoCo v3 achieves 47.3%, BEiT+DALLE reaches 47.1%, and MAE-B attains 48.1% mIoU.

Architectural Ablations

Causal masking, the shift to predicting $z_{t+1}$ , and stop-gradient are all critical to convergence and accuracy.
LayerScale, RoPE, QK-Norm, and SwiGLU each yield small, additive gains; their combination produces ~81.3% top-1 (with limited pretraining).
Training the patch embedding layer outperforms freezing it.

Longer pretraining consistently improves validation accuracy, with no observed overfitting under threefold increased compute.

7. Scalability and Extensibility Across Modalities

NEPA scales favorably: using larger backbones (e.g., ViT-L) results in proportionally higher downstream task accuracy. Increasing pretraining epochs continues to yield improvements without overfitting.

The autoregressive next-embedding prediction objective mirrors next-token language modeling (e.g., GPT), with input/output embedding tying in LLMs functioning as a formal analogue. Because the NEPA paradigm operates on continuous embeddings, its core concept is extensible to audio, video, and multi-modal autoregressive learning with minimal adaptation.

Potential generative extensions include appending pixel-decoder modules or diffusion models, presenting a unified approach to both representation learning and conditional generation under a single autoregressive modeling framework (Xu et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Next-Embedding Prediction Makes Strong Vision Learners (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Vision NEPA.