Autoregression on Input Embedding

Updated 5 January 2026

Autoregression on Input Embedding is a framework that transforms raw inputs into feature representations for modeling sequential dependencies using kernel methods and deep networks.
It enables non-linear dynamics modeling by capturing predictive sufficient statistics in high-dimensional embedding spaces for improved accuracy.
The approach finds practical applications in language modeling, financial forecasting, and event sequence analysis through contrastive and generative training objectives.

Autoregression on Input Embedding encompasses a class of methodologies in which an autoregressive model operates directly on feature representations or embeddings of input data, rather than raw observations. This perspective unifies developments in kernel methods, deep neural architectures—particularly LLMs—and structured probabilistic modeling. Unlike classical AR models, which operate on the input space (e.g., $\mathbb{R}^d$ for time series), these methods embed inputs into a feature space (possibly implicit, as in RKHS, or explicit, as with learned deep embeddings) and perform autoregressive modeling or predictions within that space.

1. Foundational Principles and Theoretical Framework

Autoregression on input embedding restructures the standard autoregressive assumption $x_t = f(x_{t-1}, x_{t-2}, ...)$ , replacing $x_t$ with a transformed or embedded version. In formal terms, for an input space $\mathcal{X}$ and feature map $\phi: \mathcal{X} \to \mathcal{H}$ (where $\mathcal{H}$ is a Hilbert or vector space), the process is defined as:

$\phi(x_t) = \sum_{i=1}^p A_i \phi(x_{t-i}) + e_t$

where $A_i$ are operators (or matrices), and $e_t$ is an innovation noise term. This setup generalizes classical linear AR processes to arbitrary feature spaces, allowing non-linear dynamics to be modeled through the properties of $\phi$ and the structure of $\mathcal{H}$ , as in kernel AR embedding or deep network embeddings (Valencia et al., 2016).

Theoretical investigations in LLMs reveal that autoregressive embeddings act as predictive sufficient statistics. For any sequence, the embedding $h_t = f(x_{\leq t})$ constructed by an autoregressive model discards all information not required to recover the next-step predictive distribution $p(x_{t+1}\mid x_{\leq t})$ (Zhang et al., 2024). For exchangeable data, $h_t$ encodes minimal sufficient statistics; for Markov/state-space processes, $h_t$ represents posterior distributions over latent states.

2. Kernel and Hilbert Space Embeddings: Nonlinear AR Models

Kernel methods provide a principled, non-parametric generalization of autoregression on embeddings, replacing input vectors by their images under a reproducing kernel feature map $\phi$ . In the framework of (Valencia et al., 2016), the core autoregressive process in a reproducing kernel Hilbert space (RKHS) allows one to express dynamics as

$\phi(x_t) = \sum_{i=1}^p \alpha_i \phi(x_{t-i}) + e_t$

for scalars $\alpha_i$ on the data span. The estimation procedure involves constructing empirical cross-covariance operators:

$\widehat{C}_{\phi_t, \phi_{t-k}} = \frac{1}{m} \Phi_t \Phi_{t-k}^\top$

with $\Phi_t$ a matrix of $m$ embedded samples, and solving block-wise Yule–Walker equations for AR coefficients. Prediction requires computing a pre-image: finding $x_{t+1}$ such that $\phi(x_{t+1})$ closely matches the autoregressive prediction in $\mathcal{H}$ . The method achieves root-mean-square-error improvements over linear AR, kernel AR via inner products, GP, and neural models on nonlinear benchmarks, enabled by the full non-parametric representational capacity of the kernel embedding (Valencia et al., 2016).

3. Deep Autoregressive Embeddings in Sequence Modeling and LLMs

In transformer-based LLMs and sequence models, each input sequence $x_{1:t}$ is mapped by an embedding function $f$ to a latent $h_t$ such that $p(x_{t+1} \mid x_{1:t}) = g(h_t)$ . Analysis demonstrates that, under next-token prediction losses, these embeddings learn to encode exactly the information required for optimal prediction, i.e., they implement predictive sufficient statistics over the underlying generative process (Zhang et al., 2024). This structure has been empirically validated in settings with:

IID data: $h_t$ captures, for example, sample mean and variance;
Hidden Markov/state-space: $h_t$ encodes the posterior over latent states;
Mixture hypotheses: $h_t$ represents posterior probabilities over mixture components.

Probing experiments with transformers show linear decodability of these statistics or posteriors from network embeddings, confirming that autoregression on embeddings yields optimally compressed, prediction-oriented representations (Zhang et al., 2024).

4. Practical Implementations: Contrastive and Generative Training of Embedding Models

Contemporary approaches leverage the autoregressive structure of LLM embeddings to design new contrastive and generative training objectives. A salient example is the AutoRegEmbed scheme (Deng et al., 17 Feb 2025), which adheres to the autoregressive inductive bias by defining embeddings as parametrizations of entire conditional distributions:

$p(d \mid e(x)) = \prod_{t=1}^T p(d_t \mid d_{<t}, e(x))$

This yields two complementary training phases:

Information Compression: Learning “compressed” embeddings via special tokens so that a frozen LLM decoder can reconstruct global target semantics. The loss:

$\mathcal{L}_{\text{IC}} = -\sum_{t=1}^T \log p(d_t \mid d_{<t}, e(x))$

Conditional Distribution Alignment: Aligning conditional distributions between query and positive pairs, using an InfoNCE-style loss on absolute log-probability differences over sampled document triplets.

Empirical results demonstrate that AutoRegEmbed achieves superior semantic alignment and uniformity, outperforming baselines dependent on cosine or KL-based alignment in both data and computational efficiency. Ablation studies corroborate the necessity of both losses for optimal performance (Deng et al., 17 Feb 2025).

5. Specialized Applications: Financial and Event Sequence Embedding

Autoregression over input embedding extends to domains such as financial transactions and time series sensory data (Skalski et al., 2024). Each event is mapped to a vector via numeric feature normalization and categorical embeddings, and sequenced through a unidirectional RNN (GRU), ensuring that at each time step the embedding $e_t$ encodes all information from current and prior events. The model is trained with both next-step prediction and past reconstruction objectives:

$\mathcal{L} = \sum_{t=0}^{T-1} [(1-\alpha)\mathcal{L}_t^{\text{NP}} + \alpha \mathcal{L}_t^{\text{PR}}]$

where $\mathcal{L}_t^{\text{NP}}$ is next-event and $\mathcal{L}_t^{\text{PR}}$ is past-event loss weighted by an exponential kernel in time gaps. This approach produces embeddings that yield state-of-the-art self-supervised representations, transfer across institutions, and deliver large gains on downstream detection and classification tasks (Skalski et al., 2024).

6. Limits and Extensions: Compression Capacity and Non-Autoregressive Reconstruction

A recent direction explores the fundamental information capacity of autoregressive embeddings. LLMs trained to reconstruct entire texts from a small set of optimized “proto-token” embeddings can recover hundreds or even thousands of tokens with near-perfect fidelity (Mezentsev et al., 27 May 2025). In standard autoregressive settings, a single memory token can compress up to $\sim$ 2000 nats, while a non-autoregressive, two-token scheme (one “entry” and one “memory” token) can store $\sim$ 800 nats—about $40\%$ of the autoregressive capacity for natural texts. The geometry of embedding space, probed via cosine distances and Bezier interpolations, reveals clustered, locally connected solution sets, indicating that non-autoregressive representations could form the basis for rapid context compression and one-step decoding. A plausible implication is the feasibility of learning dedicated encoders that map arbitrary sequences to proto-embeddings for fast, single-pass generation in frozen LLMs—a significant extension beyond strict autoregressive pipelines (Mezentsev et al., 27 May 2025).

7. Outlook and Research Frontiers

Autoregression on input embedding provides a unified lens for understanding predictive representation learning across kernel methods, deep neural architectures, and structured probabilistic models. Key frontiers include:

Extending the sufficiency principle to implicitly learned representations in massive, real-world models and data regimes (Zhang et al., 2024),
Jointly optimizing for efficient contrastive and generative representation objectives guided by autoregressive capacity (Deng et al., 17 Feb 2025),
Exploiting the geometric properties of high-capacity embedding spaces for amortized or few-shot encoding of complex input sequences (Mezentsev et al., 27 May 2025),
Application of autoregressive embedding schemes in domains beyond text and finance, including multidimensional sensor streams, biological sequences, and high-frequency transactional systems (Valencia et al., 2016, Skalski et al., 2024).

The ongoing synthesis of kernel embedding theory, neural sequence modeling, and sufficiency characterizations continues to clarify the theoretical and practical landscape for next-generation predictive representation models.