Next-Embedding Predictive Autoregression
- NEPA is a class of autoregressive models that predicts future latent embeddings instead of raw inputs, enabling effective modeling of nonlinear dependencies.
- The approach leverages both kernel-based time series forecasting and transformer-based self-supervised learning in vision, simplifying the prediction process.
- Empirical results demonstrate that NEPA achieves superior forecasting accuracy and competitive visual representation learning with reduced complexity.
Next-Embedding Predictive Autoregression (NEPA) designates a class of autoregressive modeling approaches in which future latent representations (embeddings) are predicted directly on the basis of prior observed embeddings. This paradigm admits instantiations in both kernel-based nonlinear time series modeling and self-supervised learning for high-dimensional data such as images. In each case, prediction in embedding space replaces traditional prediction in the raw input space, facilitating learning of nonlinear dependencies and leveraging the inductive biases of the chosen embedding machinery. The NEPA methodology has been demonstrated to achieve state-of-the-art performance in both nonlinear time series forecasting and large-scale vision representation learning (Valencia et al., 2016, Xu et al., 18 Dec 2025).
1. Mathematical Formulation
Two mathematically rigorous instantiations of NEPA have emerged:
Kernel-based NEPA for time series (Valencia et al., 2016):
- Let be a stationary sequence and fix order .
- Map sequences and their successors into a reproducing kernel Hilbert space (RKHS) via a feature map associated to kernel .
- Estimate the linear operator acting in RKHS by
empirically given data via
with empirical covariance operators obtained from training tuples.
- The one-step-ahead predictor is then
Representer Theorem gives
where are coefficients from regularized least squares.
Next-embedding prediction in vision Transformers (Xu et al., 18 Dec 2025):
- Let be an image split into patches .
- Encode via a shared patch embedding : .
- An autoregressive transformer predicts
- The NEPA self-supervised loss is
enforcing cosine similarity between predicted and ground-truth patch embeddings, using a stop-gradient on the target.
2. Algorithmic Details
The two NEPA formulations provide precise computational recipes.
Time series NEPA (Valencia et al., 2016):
- Build sliding windows and from data.
- Form kernel matrices and .
- Solve for operator coefficients via
and store rows .
- Prediction: for new history , output
Vision NEPA (Xu et al., 18 Dec 2025):
- For batch :
- Compute embeddings of shape .
- Compute autoregressive predictions with causal mask.
- Shift to align with ; stop-gradient on targets.
- Normalize to unit -norm.
- Compute negative mean cosine similarity as loss:
1
loss = - (pred_norm * target_norm).sum(dim=-1).mean() - Backpropagate and update parameters.
3. Architectural and Modeling Choices
Kernel NEPA leverages the kernel trick to capture nonlinear dependencies by lifting AR modeling to RKHS, typically employing a squared-exponential (SE-RBF) kernel with bandwidth chosen by cross-validation. Regularization parameter is tuned in to ensure invertibility and optimal bias-variance tradeoff. Computational complexity is for training (: number of training windows), per prediction.
Vision NEPA is implemented via a pre-norm Vision Transformer:
- Patch-embed input via Conv2d (kernel , stride ).
- ViT-B: 12 layers, 768-dim embeddings, 3072 hidden, 12 heads; ViT-L: 24 layers, 1024-dim, 4096 hidden, 16 heads.
- Pretraining uses causal attention masks.
- Stabilization: RoPE, LayerScale (), SwiGLU, QK-Norm.
- Downstream heads: linear for classification, UPerNet for segmentation.
No decoder, discrete tokenization, or negative sampling is required; a single forward pass suffices.
4. Empirical Results
Kernel NEPA time-series achieves strong one-step-ahead forecasting accuracy:
| Database | Linear AR | Kernel AR (KAM) | NEPA |
|---|---|---|---|
| Earthrot | 689.3 | 313.6 | 254.1 |
| CO | 0.657 | 0.612 | 0.519 |
| MG | 372.1 | 11.09 | 2.39 |
| Lorenz | 0.305 | 0.0284 | 0.0239 |
| Lorenz | 0.991 | 0.0454 | 0.0252 |
| Lorenz | 0.437 | 0.1242 | 0.1276 |
NEPA outperforms linear AR and previous kernel AR approaches, especially for nonlinear or chaotic series (Valencia et al., 2016).
Vision NEPA delivers competitive self-supervised visual representation learning on ImageNet-1K and ADE20K:
| Method | ViT-B Top-1 (%) | ViT-L Top-1 (%) | ADE20K mIoU (B/L) |
|---|---|---|---|
| Supervised-IN1K | — | — | 47.4 / 49.9 |
| MoCo v3 | 83.2 | 84.1 | 47.3 / 49.1 |
| BEiT | 83.4 | 85.2 | 47.1 / 53.3 |
| MAE | 83.6 | 85.6 | 48.1 / 53.6 |
| NEPA (no decoder, 1 pass) | 82.5 | 84.1 | — |
| NEPA | 83.8 | 85.3 | 48.3 / 54.0 |
NEPA attains comparable or superior results with less architectural and procedural complexity (Xu et al., 18 Dec 2025).
5. Comparative Perspective and Applications
NEPA in time series constitutes a nonlinear alternative to classical AR modeling, generalizing conditional expectation by learning linear operators in RKHS. This permits effective modeling of highly nonlinear and chaotic dynamics, especially for moderate data regimes where regularization is effective and high-parameter neural models or GP regressors risk overfitting (Valencia et al., 2016).
In visual self-supervised learning, NEPA offers an alternative to pixel reconstruction, masked token prediction, and contrastive or distillation approaches by instead training a causal predictor on sequences of patch embeddings. Notably, NEPA eschews complex decoders, negative pairs, or tokenization steps, enabling a simple and scalable pretraining pipeline. The modeling principle is directly analogous to the GPT paradigm in language modeling, where next-token is replaced by next-embedding prediction (Xu et al., 18 Dec 2025).
6. Limitations and Extensions
Kernel NEPA's computational bottleneck is scaling in the number of training windows, with memory and inversion constraints for large . Practical application is favored for moderate and when the signal is highly nonlinear or chaotic; regularized operator estimation and cross-validated hyperparameters are essential for robust generalization (Valencia et al., 2016).
Vision NEPA, although competitive at end-to-end fine-tuning, exhibits relatively poor performance under linear probing (approx. 14% top-1), suggesting that the learned representations are tightly coupled to the pretrained predictor and may not universally transfer. Additional limitations include challenges modeling complex spatial phenomena (e.g., reflections) and the need for scaling to larger or more diverse data regimes. Potential extensions include hybridization with generative decoders or diffusion models, scaling to multimodal or cross-modal predictive tasks, and integration with language autoregressive objectives (Xu et al., 18 Dec 2025).
7. Significance and Future Directions
The NEPA paradigm—predicting next embeddings rather than raw tokens—demonstrates efficacy across divergent domains, preserving scalability and simplicity while permitting the direct modeling of nonlinear dependencies. In time series, RKHS embedding harnesses nonparametric flexibility; in vision, transformer-based NEPA unifies autoregressive prediction and self-supervision without auxiliary heads or losses. A plausible implication is that next-embedding prediction may form a modality-agnostic foundation for generative and self-supervised learning frameworks, and further research into larger-scale, multimodal, or hybrid NEPA instantiations is indicated (Valencia et al., 2016, Xu et al., 18 Dec 2025).