Papers
Topics
Authors
Recent
2000 character limit reached

Vision NEPA: Next-Embedding Predictive Learning

Updated 23 December 2025
  • Vision NEPA is a self-supervised visual learning framework that reformulates representation learning as autoregressive next-patch embedding prediction.
  • It employs a causal Vision Transformer backbone with stop-gradient stabilization to prevent collapse and ensure efficient training.
  • Empirical results show state-of-the-art performance on ImageNet-1K classification (up to 85.3% accuracy) and ADE20K segmentation benchmarks.

Vision NEPA (Next-Embedding Predictive Autoregression) is a self-supervised learning framework for visual representation learning that casts the core objective as autoregressive next-latent-patch prediction. Inspired by generative pretraining paradigms in natural language processing, NEPA eschews pixel reconstruction, discrete tokenization, and contrastive losses, focusing instead on predicting future patch embeddings from prior ones within an image, using causal masking and a stop-gradient mechanism to stabilize training and prevent representational collapse. Vision NEPA achieves state-of-the-art transfer across ImageNet-1K classification and ADE20K segmentation benchmarks, demonstrating scalable performance and modality-agnostic potential (Xu et al., 18 Dec 2025).

1. Next-Embedding Predictive Objective

The fundamental learning paradigm in Vision NEPA reformulates self-supervised visual representation learning as autoregressive embedding prediction. For input image xRH×W×Cx \in \mathbb{R}^{H \times W \times C}, the image is tokenized into TT non-overlapping patches, where T=(H/P)(W/P)T = (H/P) \cdot (W/P) for patch size P×PP \times P. Each patch is mapped by a learned embedding layer ff to a DD-dimensional vector. The encoder output for patch tt is zt=f(patcht)+ptz_t = f(\text{patch}_t) + p_t, where ptp_t is a positional encoding (absolute or RoPE).

A causal Transformer predictor hθh_\theta autoregressively produces hypotheses for the next patch embedding: z^t+1=hθ(z1,,zt)\hat{z}_{t+1} = h_\theta(z_1, \dots, z_t) The core self-supervised loss is the mean negative cosine similarity between predicted embedding z^t+1\hat{z}_{t+1} and the stop-gradient version of the ground truth zt+1z_{t+1}: L=1T1t=1T1cos(z^t+1, stopgrad(zt+1))\mathcal{L} = \frac{1}{T-1} \sum_{t=1}^{T-1} \ell_\text{cos}\left( \hat{z}_{t+1},~ \mathrm{stopgrad}(z_{t+1}) \right) where cos(a,b)=aa2bb2\ell_\text{cos}(a,b) = -\frac{a}{\|a\|_2} \cdot \frac{b}{\|b\|_2}. This next-embed-predict objective avoids collapse and removes the need for auxiliary targets such as pixels, discrete codes, or negative pairs. The stop-gradient operation ensures targets are fixed with respect to encoder parameters, crucial for stable dynamics.

2. Architecture and Pretraining Pipeline

Patch Embedding and Positional Encoding

A single Conv2d layer projects P×PP \times P image patches to embedding space. Either learned absolute position embeddings or rotary position encodings (RoPE) are used, with RoPE also providing relative positional bias in attention layers.

Vision Transformer Backbone

NEPA employs a pre-norm Vision Transformer (ViT) backbone, with two backbone variants:

  • ViT-B/14 (Base)
  • ViT-L/14 (Large)

Key architectural modifications include:

  • SwiGLU activation replacing GeLU in MLP blocks
  • LayerScale (init=1e-5) on residuals for improved optimization
  • QK-Norm (parameter-free LayerNorm for queries/keys) per attention layer
  • Causal masking during pretraining: token tt attends only to indices t\leq t

Prediction and Fine-Tuning

No auxiliary prediction head is required; a linear projection of the final Transformer activations produces z^t+1\hat{z}_{t+1}. For downstream tasks:

  • Classification: zTz_T is passed through a linear classifier and softmax
  • Segmentation: UPerNet head is attached, using features at multiple scales, with bidirectional attention enabled

3. Stop-Gradient Dynamics and Ablations

The use of the stop-gradient operator in the loss is central to preventing trivial collapse. By writing the loss target as stopgrad(zt+1)\mathrm{stopgrad}(z_{t+1}), gradients are blocked from propagating back into the patch embedding layer ff, so the predictor learns to regress onto a fixed target.

Ablation studies confirm that removing stop-gradient leads to collapse (loss 1\to -1, no downstream accuracy); including it enables stable training and successful fine-tuning, with ViT-B achieving 76.8% top-1 classification accuracy after only 50k pretraining steps. The causal shift (predicting zt+1z_{t+1} instead of an identity mapping) and the retention of strict causal masking are both necessary—for example, omitting the causal mask drops accuracy to 73.6% (Xu et al., 18 Dec 2025).

4. Training Protocol and Data Handling

Pretraining Setup

  • Dataset: ImageNet-1K (unlabeled)
  • Batch Size: 4,096
  • Optimizer: AdamW (β1\beta_1=0.9, β2\beta_2=0.95, weight decay=0.05)
  • Learning Rate: 3×1043 \times 10^{-4} (scaled by batch/256), cosine schedule, 40-epoch linear warmup
  • Data Augmentation: RandomResizedCrop only; no multi-view or negative pairs
  • Training Duration: ViT-B (1,600 epochs, ~3 days on 8×H100); ViT-L (800 epochs, ~5 days)
  • EMA: Exponential Moving Average of weights (decay=0.9999) used for evaluation

Fine-Tuning Regimes

  • Classification: Batch size 1,024, AdamW, learning rate 1×10⁻³ (scaled), cosine schedule, 5-epoch warmup, 100 fine-tuning epochs (ViT-B), 50 (ViT-L)
    • Augmentations: RandAugment, MixUp (0.8), CutMix (1.0), label smoothing (0.1), DropPath (0.1/0.2)
    • Layer-wise learning rate decay (ViT-B: 0.35→1.0; ViT-L: 0.60→1.0)
  • Segmentation (ADE20K): 512×512 crops, batch 16, UPerNet, bidirectional attention, 160k iterations

5. Comparison with Alternative Self-Supervised Paradigms

Vision NEPA differs fundamentally from other major self-supervised strategies:

Method Core Mechanism Auxiliary Heads/Targets Negatives/Contrastive Masking
NEPA Predict next latent embedding None No No*
MAE Pixel reconstruction Transformer decoder No Yes (75%)
BEiT/Discrete Autoregressive discrete codes Separate codebook, predictor No Optional
SimCLR/MoCo Contrastive learning Siamese encoders Yes No
JEPA Predict across augmented views Two encoders + head No Optional

* Random masking can be ablated but is not required; NEPA works best without it.

NEPA uses a single-stream causal predictor (no momentum encoder, negative sampling, or codebook), predicting in continuous latent space, and does not require masking or pixel-wise losses as in MAE (Xu et al., 18 Dec 2025).

6. Empirical Performance and Ablation Outcomes

Classification (ImageNet-1K, Top-1 Accuracy)

  • ViT-B/14 (NEPA-B): 83.8%
  • ViT-L/14 (NEPA-L): 85.3%

Comparative scores for established self-supervised methods (all on ViT-B): MoCo v3-B (83.2%), BEiT-B (83.4%), DINO-B (83.6%), MAE-B (83.6%).

Semantic Segmentation (ADE20K, mIoU)

  • NEPA-B: 48.3%
  • NEPA-L: 54.0%

For reference, ViT-B with MoCo v3 achieves 47.3%, BEiT+DALLE reaches 47.1%, and MAE-B attains 48.1% mIoU.

Architectural Ablations

  • Causal masking, the shift to predicting zt+1z_{t+1}, and stop-gradient are all critical to convergence and accuracy.
  • LayerScale, RoPE, QK-Norm, and SwiGLU each yield small, additive gains; their combination produces ~81.3% top-1 (with limited pretraining).
  • Training the patch embedding layer outperforms freezing it.

Longer pretraining consistently improves validation accuracy, with no observed overfitting under threefold increased compute.

7. Scalability and Extensibility Across Modalities

NEPA scales favorably: using larger backbones (e.g., ViT-L) results in proportionally higher downstream task accuracy. Increasing pretraining epochs continues to yield improvements without overfitting.

The autoregressive next-embedding prediction objective mirrors next-token language modeling (e.g., GPT), with input/output embedding tying in LLMs functioning as a formal analogue. Because the NEPA paradigm operates on continuous embeddings, its core concept is extensible to audio, video, and multi-modal autoregressive learning with minimal adaptation.

Potential generative extensions include appending pixel-decoder modules or diffusion models, presenting a unified approach to both representation learning and conditional generation under a single autoregressive modeling framework (Xu et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Vision NEPA.