Papers
Topics
Authors
Recent
2000 character limit reached

Next-Embedding Predictive Autoregression

Updated 19 December 2025
  • NEPA is a class of autoregressive models that predicts future latent embeddings instead of raw inputs, enabling effective modeling of nonlinear dependencies.
  • The approach leverages both kernel-based time series forecasting and transformer-based self-supervised learning in vision, simplifying the prediction process.
  • Empirical results demonstrate that NEPA achieves superior forecasting accuracy and competitive visual representation learning with reduced complexity.

Next-Embedding Predictive Autoregression (NEPA) designates a class of autoregressive modeling approaches in which future latent representations (embeddings) are predicted directly on the basis of prior observed embeddings. This paradigm admits instantiations in both kernel-based nonlinear time series modeling and self-supervised learning for high-dimensional data such as images. In each case, prediction in embedding space replaces traditional prediction in the raw input space, facilitating learning of nonlinear dependencies and leveraging the inductive biases of the chosen embedding machinery. The NEPA methodology has been demonstrated to achieve state-of-the-art performance in both nonlinear time series forecasting and large-scale vision representation learning (Valencia et al., 2016, Xu et al., 18 Dec 2025).

1. Mathematical Formulation

Two mathematically rigorous instantiations of NEPA have emerged:

Kernel-based NEPA for time series (Valencia et al., 2016):

  • Let {xt}t=1T\{x_t\}_{t=1}^T be a stationary sequence and fix order pp.
  • Map sequences X=[xt−p+1,…,xt]X = [x_{t-p+1}, \dots, x_t] and their successors Y=xt+1Y = x_{t+1} into a reproducing kernel Hilbert space (RKHS) via a feature map Ï•\phi associated to kernel kk.
  • Estimate the linear operator AA acting in RKHS by

A=CYXCXX−1A = C_{YX}C_{XX}^{-1}

empirically given data via

A^=C^YX(C^XX+γI)−1\widehat{A} = \widehat{C}_{YX}(\widehat{C}_{XX} + \gamma I)^{-1}

with empirical covariance operators obtained from training tuples.

  • The one-step-ahead predictor is then

ϕ^(xt+1)=Aϕ([xt−p+1,…,xt]⊤).\widehat{\phi}(x_{t+1}) = A \phi([x_{t-p+1}, \dots, x_t]^\top).

Representer Theorem gives

ϕ^(xt+1)=∑i=1Nαik(Xi,[xt−p+1,...,xt])\widehat{\phi}(x_{t+1}) = \sum_{i=1}^N \alpha_i k(X_i, [x_{t-p+1},...,x_t])

where {αi}\{\alpha_i\} are coefficients from regularized least squares.

Next-embedding prediction in vision Transformers (Xu et al., 18 Dec 2025):

  • Let xx be an image split into patches x1,...,xTx_1, ..., x_T.
  • Encode xx via a shared patch embedding ff: et=f(xt)e_t = f(x_t).
  • An autoregressive transformer hθh_\theta predicts

e^t+1=hθ(e1,...,et)\hat{e}_{t+1} = h_\theta(e_1,...,e_t)

  • The NEPA self-supervised loss is

LNEPA(x;θ)=−1T−1∑t=1T−1⟨stopgrad(et+1)∥et+1∥2,e^t+1∥e^t+1∥2⟩L_{\text{NEPA}}(x;\theta) = -\frac{1}{T-1} \sum_{t=1}^{T-1} \left\langle \frac{\mathrm{stopgrad}(e_{t+1})}{\|e_{t+1}\|_2}, \frac{\hat{e}_{t+1}}{\|\hat{e}_{t+1}\|_2}\right\rangle

enforcing cosine similarity between predicted and ground-truth patch embeddings, using a stop-gradient on the target.

2. Algorithmic Details

The two NEPA formulations provide precise computational recipes.

Time series NEPA (Valencia et al., 2016):

  1. Build sliding windows XiX_i and YiY_i from data.
  2. Form kernel matrices Kij=k(Xi,Xj)K_{ij} = k(X_i,X_j) and Hij=k(Yi,Xj)H_{ij} = k(Y_i, X_j).
  3. Solve for operator coefficients via

A=H(K+γNI)−1A = H (K + \gamma N I)^{-1}

and store rows αi\alpha_i.

  1. Prediction: for new history X∗X_*, output

∑i=1Nαik(Xi,X∗).\sum_{i=1}^N \alpha_i k(X_i, X_*).

Vision NEPA (Xu et al., 18 Dec 2025):

  • For batch xx:
    1. Compute embeddings z=f(x)z = f(x) of shape [B,T,D][B,T,D].
    2. Compute autoregressive predictions zhat=h(z)z_\text{hat} = h(z) with causal mask.
    3. Shift to align zhat[t−1]z_\text{hat}[t-1] with z[t]z[t]; stop-gradient on targets.
    4. Normalize to unit â„“2\ell_2-norm.
    5. Compute negative mean cosine similarity as loss:
      1
      
      loss = - (pred_norm * target_norm).sum(dim=-1).mean()
    6. Backpropagate and update parameters.

3. Architectural and Modeling Choices

Kernel NEPA leverages the kernel trick to capture nonlinear dependencies by lifting AR modeling to RKHS, typically employing a squared-exponential (SE-RBF) kernel with bandwidth chosen by cross-validation. Regularization parameter γ\gamma is tuned in [10−4,10−1][10^{-4}, 10^{-1}] to ensure invertibility and optimal bias-variance tradeoff. Computational complexity is O(N3)O(N^3) for training (NN: number of training windows), O(N⋅p)O(N \cdot p) per prediction.

Vision NEPA is implemented via a pre-norm Vision Transformer:

  • Patch-embed input via Conv2d (kernel (P,P)(P,P), stride PP).
  • ViT-B: 12 layers, 768-dim embeddings, 3072 hidden, 12 heads; ViT-L: 24 layers, 1024-dim, 4096 hidden, 16 heads.
  • Pretraining uses causal attention masks.
  • Stabilization: RoPE, LayerScale (1e−51\text{e}-5), SwiGLU, QK-Norm.
  • Downstream heads: linear for classification, UPerNet for segmentation.

No decoder, discrete tokenization, or negative sampling is required; a single forward pass suffices.

4. Empirical Results

Kernel NEPA time-series achieves strong one-step-ahead forecasting accuracy:

Database Linear AR Kernel AR (KAM) NEPA
Earthrot 689.3 313.6 254.1
CO2_2 0.657 0.612 0.519
MG30_{30} 372.1 11.09 2.39
Lorenz xx 0.305 0.0284 0.0239
Lorenz yy 0.991 0.0454 0.0252
Lorenz zz 0.437 0.1242 0.1276

NEPA outperforms linear AR and previous kernel AR approaches, especially for nonlinear or chaotic series (Valencia et al., 2016).

Vision NEPA delivers competitive self-supervised visual representation learning on ImageNet-1K and ADE20K:

Method ViT-B Top-1 (%) ViT-L Top-1 (%) ADE20K mIoU (B/L)
Supervised-IN1K — — 47.4 / 49.9
MoCo v3 83.2 84.1 47.3 / 49.1
BEiT 83.4 85.2 47.1 / 53.3
MAE 83.6 85.6 48.1 / 53.6
NEPA (no decoder, 1 pass) 82.5 84.1 —
NEPA 83.8 85.3 48.3 / 54.0

NEPA attains comparable or superior results with less architectural and procedural complexity (Xu et al., 18 Dec 2025).

5. Comparative Perspective and Applications

NEPA in time series constitutes a nonlinear alternative to classical AR modeling, generalizing conditional expectation by learning linear operators in RKHS. This permits effective modeling of highly nonlinear and chaotic dynamics, especially for moderate data regimes where regularization is effective and high-parameter neural models or GP regressors risk overfitting (Valencia et al., 2016).

In visual self-supervised learning, NEPA offers an alternative to pixel reconstruction, masked token prediction, and contrastive or distillation approaches by instead training a causal predictor on sequences of patch embeddings. Notably, NEPA eschews complex decoders, negative pairs, or tokenization steps, enabling a simple and scalable pretraining pipeline. The modeling principle is directly analogous to the GPT paradigm in language modeling, where next-token is replaced by next-embedding prediction (Xu et al., 18 Dec 2025).

6. Limitations and Extensions

Kernel NEPA's computational bottleneck is O(N3)O(N^3) scaling in the number of training windows, with memory and inversion constraints for large NN. Practical application is favored for moderate NN and when the signal is highly nonlinear or chaotic; regularized operator estimation and cross-validated hyperparameters are essential for robust generalization (Valencia et al., 2016).

Vision NEPA, although competitive at end-to-end fine-tuning, exhibits relatively poor performance under linear probing (approx. 14% top-1), suggesting that the learned representations are tightly coupled to the pretrained predictor and may not universally transfer. Additional limitations include challenges modeling complex spatial phenomena (e.g., reflections) and the need for scaling to larger or more diverse data regimes. Potential extensions include hybridization with generative decoders or diffusion models, scaling to multimodal or cross-modal predictive tasks, and integration with language autoregressive objectives (Xu et al., 18 Dec 2025).

7. Significance and Future Directions

The NEPA paradigm—predicting next embeddings rather than raw tokens—demonstrates efficacy across divergent domains, preserving scalability and simplicity while permitting the direct modeling of nonlinear dependencies. In time series, RKHS embedding harnesses nonparametric flexibility; in vision, transformer-based NEPA unifies autoregressive prediction and self-supervision without auxiliary heads or losses. A plausible implication is that next-embedding prediction may form a modality-agnostic foundation for generative and self-supervised learning frameworks, and further research into larger-scale, multimodal, or hybrid NEPA instantiations is indicated (Valencia et al., 2016, Xu et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Next-Embedding Predictive Autoregression (NEPA).