Next-Embedding Predictive Autoregression

Updated 19 December 2025

NEPA is a class of autoregressive models that predicts future latent embeddings instead of raw inputs, enabling effective modeling of nonlinear dependencies.
The approach leverages both kernel-based time series forecasting and transformer-based self-supervised learning in vision, simplifying the prediction process.
Empirical results demonstrate that NEPA achieves superior forecasting accuracy and competitive visual representation learning with reduced complexity.

Next-Embedding Predictive Autoregression (NEPA) designates a class of autoregressive modeling approaches in which future latent representations (embeddings) are predicted directly on the basis of prior observed embeddings. This paradigm admits instantiations in both kernel-based nonlinear time series modeling and self-supervised learning for high-dimensional data such as images. In each case, prediction in embedding space replaces traditional prediction in the raw input space, facilitating learning of nonlinear dependencies and leveraging the inductive biases of the chosen embedding machinery. The NEPA methodology has been demonstrated to achieve state-of-the-art performance in both nonlinear time series forecasting and large-scale vision representation learning (Valencia et al., 2016, Xu et al., 18 Dec 2025).

1. Mathematical Formulation

Two mathematically rigorous instantiations of NEPA have emerged:

Kernel-based NEPA for time series (Valencia et al., 2016):

Let $\{x_t\}_{t=1}^T$ be a stationary sequence and fix order $p$ .
Map sequences $X = [x_{t-p+1}, \dots, x_t]$ and their successors $Y = x_{t+1}$ into a reproducing kernel Hilbert space (RKHS) via a feature map $\phi$ associated to kernel $k$ .
Estimate the linear operator $A$ acting in RKHS by

$A = C_{YX}C_{XX}^{-1}$

empirically given data via

$\widehat{A} = \widehat{C}_{YX}(\widehat{C}_{XX} + \gamma I)^{-1}$

with empirical covariance operators obtained from training tuples.

The one-step-ahead predictor is then

$\widehat{\phi}(x_{t+1}) = A \phi([x_{t-p+1}, \dots, x_t]^\top).$

Representer Theorem gives

$\widehat{\phi}(x_{t+1}) = \sum_{i=1}^N \alpha_i k(X_i, [x_{t-p+1},...,x_t])$

where $\{\alpha_i\}$ are coefficients from regularized least squares.

Next-embedding prediction in vision Transformers (Xu et al., 18 Dec 2025):

Let $x$ be an image split into patches $x_1, ..., x_T$ .
Encode $x$ via a shared patch embedding $f$ : $e_t = f(x_t)$ .
An autoregressive transformer $h_\theta$ predicts

$\hat{e}_{t+1} = h_\theta(e_1,...,e_t)$

The NEPA self-supervised loss is

$L_{\text{NEPA}}(x;\theta) = -\frac{1}{T-1} \sum_{t=1}^{T-1} \left\langle \frac{\mathrm{stopgrad}(e_{t+1})}{\|e_{t+1}\|_2}, \frac{\hat{e}_{t+1}}{\|\hat{e}_{t+1}\|_2}\right\rangle$

enforcing cosine similarity between predicted and ground-truth patch embeddings, using a stop-gradient on the target.

2. Algorithmic Details

The two NEPA formulations provide precise computational recipes.

Time series NEPA (Valencia et al., 2016):

Build sliding windows $X_i$ and $Y_i$ from data.
Form kernel matrices $K_{ij} = k(X_i,X_j)$ and $H_{ij} = k(Y_i, X_j)$ .
Solve for operator coefficients via

$A = H (K + \gamma N I)^{-1}$

and store rows $\alpha_i$ .

Prediction: for new history $X_*$ , output

$\sum_{i=1}^N \alpha_i k(X_i, X_*).$

Vision NEPA (Xu et al., 18 Dec 2025):

For batch $x$ $x$ :
1. Compute embeddings $z = f(x)$ of shape $[B,T,D]$ .
2. Compute autoregressive predictions $z_\text{hat} = h(z)$ with causal mask.
3. Shift to align $z_\text{hat}[t-1]$ with $z[t]$ ; stop-gradient on targets.
4. Normalize to unit $\ell_2$ -norm.
5. Compute negative mean cosine similarity as loss:
  1
  
  loss = - (pred_norm * target_norm).sum(dim=-1).mean()
6. Backpropagate and update parameters.

3. Architectural and Modeling Choices

Kernel NEPA leverages the kernel trick to capture nonlinear dependencies by lifting AR modeling to RKHS, typically employing a squared-exponential (SE-RBF) kernel with bandwidth chosen by cross-validation. Regularization parameter $\gamma$ is tuned in $[10^{-4}, 10^{-1}]$ to ensure invertibility and optimal bias-variance tradeoff. Computational complexity is $O(N^3)$ for training ( $N$ : number of training windows), $O(N \cdot p)$ per prediction.

Vision NEPA is implemented via a pre-norm Vision Transformer:

Patch-embed input via Conv2d (kernel $(P,P)$ , stride $P$ ).
ViT-B: 12 layers, 768-dim embeddings, 3072 hidden, 12 heads; ViT-L: 24 layers, 1024-dim, 4096 hidden, 16 heads.
Pretraining uses causal attention masks.
Stabilization: RoPE, LayerScale ( $1\text{e}-5$ ), SwiGLU, QK-Norm.
Downstream heads: linear for classification, UPerNet for segmentation.

No decoder, discrete tokenization, or negative sampling is required; a single forward pass suffices.

4. Empirical Results

Kernel NEPA time-series achieves strong one-step-ahead forecasting accuracy:

Database	Linear AR	Kernel AR (KAM)	NEPA
Earthrot	689.3	313.6	254.1
CO $_2$	0.657	0.612	0.519
MG $_{30}$	372.1	11.09	2.39
Lorenz $x$	0.305	0.0284	0.0239
Lorenz $y$	0.991	0.0454	0.0252
Lorenz $z$	0.437	0.1242	0.1276

NEPA outperforms linear AR and previous kernel AR approaches, especially for nonlinear or chaotic series (Valencia et al., 2016).

Vision NEPA delivers competitive self-supervised visual representation learning on ImageNet-1K and ADE20K:

Method	ViT-B Top-1 (%)	ViT-L Top-1 (%)	ADE20K mIoU (B/L)
Supervised-IN1K	—	—	47.4 / 49.9
MoCo v3	83.2	84.1	47.3 / 49.1
BEiT	83.4	85.2	47.1 / 53.3
MAE	83.6	85.6	48.1 / 53.6
NEPA (no decoder, 1 pass)	82.5	84.1	—
NEPA	83.8	85.3	48.3 / 54.0

NEPA attains comparable or superior results with less architectural and procedural complexity (Xu et al., 18 Dec 2025).

5. Comparative Perspective and Applications

NEPA in time series constitutes a nonlinear alternative to classical AR modeling, generalizing conditional expectation by learning linear operators in RKHS. This permits effective modeling of highly nonlinear and chaotic dynamics, especially for moderate data regimes where regularization is effective and high-parameter neural models or GP regressors risk overfitting (Valencia et al., 2016).

In visual self-supervised learning, NEPA offers an alternative to pixel reconstruction, masked token prediction, and contrastive or distillation approaches by instead training a causal predictor on sequences of patch embeddings. Notably, NEPA eschews complex decoders, negative pairs, or tokenization steps, enabling a simple and scalable pretraining pipeline. The modeling principle is directly analogous to the GPT paradigm in language modeling, where next-token is replaced by next-embedding prediction (Xu et al., 18 Dec 2025).

6. Limitations and Extensions

Kernel NEPA's computational bottleneck is $O(N^3)$ scaling in the number of training windows, with memory and inversion constraints for large $N$ . Practical application is favored for moderate $N$ and when the signal is highly nonlinear or chaotic; regularized operator estimation and cross-validated hyperparameters are essential for robust generalization (Valencia et al., 2016).

Vision NEPA, although competitive at end-to-end fine-tuning, exhibits relatively poor performance under linear probing (approx. 14% top-1), suggesting that the learned representations are tightly coupled to the pretrained predictor and may not universally transfer. Additional limitations include challenges modeling complex spatial phenomena (e.g., reflections) and the need for scaling to larger or more diverse data regimes. Potential extensions include hybridization with generative decoders or diffusion models, scaling to multimodal or cross-modal predictive tasks, and integration with language autoregressive objectives (Xu et al., 18 Dec 2025).

7. Significance and Future Directions

The NEPA paradigm—predicting next embeddings rather than raw tokens—demonstrates efficacy across divergent domains, preserving scalability and simplicity while permitting the direct modeling of nonlinear dependencies. In time series, RKHS embedding harnesses nonparametric flexibility; in vision, transformer-based NEPA unifies autoregressive prediction and self-supervision without auxiliary heads or losses. A plausible implication is that next-embedding prediction may form a modality-agnostic foundation for generative and self-supervised learning frameworks, and further research into larger-scale, multimodal, or hybrid NEPA instantiations is indicated (Valencia et al., 2016, Xu et al., 18 Dec 2025).