Papers
Topics
Authors
Recent
2000 character limit reached

Autoregression on Input Embedding

Updated 5 January 2026
  • Autoregression on Input Embedding is a framework that transforms raw inputs into feature representations for modeling sequential dependencies using kernel methods and deep networks.
  • It enables non-linear dynamics modeling by capturing predictive sufficient statistics in high-dimensional embedding spaces for improved accuracy.
  • The approach finds practical applications in language modeling, financial forecasting, and event sequence analysis through contrastive and generative training objectives.

Autoregression on Input Embedding encompasses a class of methodologies in which an autoregressive model operates directly on feature representations or embeddings of input data, rather than raw observations. This perspective unifies developments in kernel methods, deep neural architectures—particularly LLMs—and structured probabilistic modeling. Unlike classical AR models, which operate on the input space (e.g., Rd\mathbb{R}^d for time series), these methods embed inputs into a feature space (possibly implicit, as in RKHS, or explicit, as with learned deep embeddings) and perform autoregressive modeling or predictions within that space.

1. Foundational Principles and Theoretical Framework

Autoregression on input embedding restructures the standard autoregressive assumption xt=f(xt1,xt2,...)x_t = f(x_{t-1}, x_{t-2}, ...), replacing xtx_t with a transformed or embedded version. In formal terms, for an input space X\mathcal{X} and feature map ϕ:XH\phi: \mathcal{X} \to \mathcal{H} (where H\mathcal{H} is a Hilbert or vector space), the process is defined as:

ϕ(xt)=i=1pAiϕ(xti)+et\phi(x_t) = \sum_{i=1}^p A_i \phi(x_{t-i}) + e_t

where AiA_i are operators (or matrices), and ete_t is an innovation noise term. This setup generalizes classical linear AR processes to arbitrary feature spaces, allowing non-linear dynamics to be modeled through the properties of ϕ\phi and the structure of H\mathcal{H}, as in kernel AR embedding or deep network embeddings (Valencia et al., 2016).

Theoretical investigations in LLMs reveal that autoregressive embeddings act as predictive sufficient statistics. For any sequence, the embedding ht=f(xt)h_t = f(x_{\leq t}) constructed by an autoregressive model discards all information not required to recover the next-step predictive distribution p(xt+1xt)p(x_{t+1}\mid x_{\leq t}) (Zhang et al., 2024). For exchangeable data, hth_t encodes minimal sufficient statistics; for Markov/state-space processes, hth_t represents posterior distributions over latent states.

2. Kernel and Hilbert Space Embeddings: Nonlinear AR Models

Kernel methods provide a principled, non-parametric generalization of autoregression on embeddings, replacing input vectors by their images under a reproducing kernel feature map ϕ\phi. In the framework of (Valencia et al., 2016), the core autoregressive process in a reproducing kernel Hilbert space (RKHS) allows one to express dynamics as

ϕ(xt)=i=1pαiϕ(xti)+et\phi(x_t) = \sum_{i=1}^p \alpha_i \phi(x_{t-i}) + e_t

for scalars αi\alpha_i on the data span. The estimation procedure involves constructing empirical cross-covariance operators:

C^ϕt,ϕtk=1mΦtΦtk\widehat{C}_{\phi_t, \phi_{t-k}} = \frac{1}{m} \Phi_t \Phi_{t-k}^\top

with Φt\Phi_t a matrix of mm embedded samples, and solving block-wise Yule–Walker equations for AR coefficients. Prediction requires computing a pre-image: finding xt+1x_{t+1} such that ϕ(xt+1)\phi(x_{t+1}) closely matches the autoregressive prediction in H\mathcal{H}. The method achieves root-mean-square-error improvements over linear AR, kernel AR via inner products, GP, and neural models on nonlinear benchmarks, enabled by the full non-parametric representational capacity of the kernel embedding (Valencia et al., 2016).

3. Deep Autoregressive Embeddings in Sequence Modeling and LLMs

In transformer-based LLMs and sequence models, each input sequence x1:tx_{1:t} is mapped by an embedding function ff to a latent hth_t such that p(xt+1x1:t)=g(ht)p(x_{t+1} \mid x_{1:t}) = g(h_t). Analysis demonstrates that, under next-token prediction losses, these embeddings learn to encode exactly the information required for optimal prediction, i.e., they implement predictive sufficient statistics over the underlying generative process (Zhang et al., 2024). This structure has been empirically validated in settings with:

  • IID data: hth_t captures, for example, sample mean and variance;
  • Hidden Markov/state-space: hth_t encodes the posterior over latent states;
  • Mixture hypotheses: hth_t represents posterior probabilities over mixture components.

Probing experiments with transformers show linear decodability of these statistics or posteriors from network embeddings, confirming that autoregression on embeddings yields optimally compressed, prediction-oriented representations (Zhang et al., 2024).

4. Practical Implementations: Contrastive and Generative Training of Embedding Models

Contemporary approaches leverage the autoregressive structure of LLM embeddings to design new contrastive and generative training objectives. A salient example is the AutoRegEmbed scheme (Deng et al., 17 Feb 2025), which adheres to the autoregressive inductive bias by defining embeddings as parametrizations of entire conditional distributions:

p(de(x))=t=1Tp(dtd<t,e(x))p(d \mid e(x)) = \prod_{t=1}^T p(d_t \mid d_{<t}, e(x))

This yields two complementary training phases:

  1. Information Compression: Learning “compressed” embeddings via special tokens so that a frozen LLM decoder can reconstruct global target semantics. The loss:

LIC=t=1Tlogp(dtd<t,e(x))\mathcal{L}_{\text{IC}} = -\sum_{t=1}^T \log p(d_t \mid d_{<t}, e(x))

  1. Conditional Distribution Alignment: Aligning conditional distributions between query and positive pairs, using an InfoNCE-style loss on absolute log-probability differences over sampled document triplets.

Empirical results demonstrate that AutoRegEmbed achieves superior semantic alignment and uniformity, outperforming baselines dependent on cosine or KL-based alignment in both data and computational efficiency. Ablation studies corroborate the necessity of both losses for optimal performance (Deng et al., 17 Feb 2025).

5. Specialized Applications: Financial and Event Sequence Embedding

Autoregression over input embedding extends to domains such as financial transactions and time series sensory data (Skalski et al., 2024). Each event is mapped to a vector via numeric feature normalization and categorical embeddings, and sequenced through a unidirectional RNN (GRU), ensuring that at each time step the embedding ete_t encodes all information from current and prior events. The model is trained with both next-step prediction and past reconstruction objectives:

L=t=0T1[(1α)LtNP+αLtPR]\mathcal{L} = \sum_{t=0}^{T-1} [(1-\alpha)\mathcal{L}_t^{\text{NP}} + \alpha \mathcal{L}_t^{\text{PR}}]

where LtNP\mathcal{L}_t^{\text{NP}} is next-event and LtPR\mathcal{L}_t^{\text{PR}} is past-event loss weighted by an exponential kernel in time gaps. This approach produces embeddings that yield state-of-the-art self-supervised representations, transfer across institutions, and deliver large gains on downstream detection and classification tasks (Skalski et al., 2024).

6. Limits and Extensions: Compression Capacity and Non-Autoregressive Reconstruction

A recent direction explores the fundamental information capacity of autoregressive embeddings. LLMs trained to reconstruct entire texts from a small set of optimized “proto-token” embeddings can recover hundreds or even thousands of tokens with near-perfect fidelity (Mezentsev et al., 27 May 2025). In standard autoregressive settings, a single memory token can compress up to \sim2000 nats, while a non-autoregressive, two-token scheme (one “entry” and one “memory” token) can store \sim800 nats—about 40%40\% of the autoregressive capacity for natural texts. The geometry of embedding space, probed via cosine distances and Bezier interpolations, reveals clustered, locally connected solution sets, indicating that non-autoregressive representations could form the basis for rapid context compression and one-step decoding. A plausible implication is the feasibility of learning dedicated encoders that map arbitrary sequences to proto-embeddings for fast, single-pass generation in frozen LLMs—a significant extension beyond strict autoregressive pipelines (Mezentsev et al., 27 May 2025).

7. Outlook and Research Frontiers

Autoregression on input embedding provides a unified lens for understanding predictive representation learning across kernel methods, deep neural architectures, and structured probabilistic models. Key frontiers include:

  • Extending the sufficiency principle to implicitly learned representations in massive, real-world models and data regimes (Zhang et al., 2024),
  • Jointly optimizing for efficient contrastive and generative representation objectives guided by autoregressive capacity (Deng et al., 17 Feb 2025),
  • Exploiting the geometric properties of high-capacity embedding spaces for amortized or few-shot encoding of complex input sequences (Mezentsev et al., 27 May 2025),
  • Application of autoregressive embedding schemes in domains beyond text and finance, including multidimensional sensor streams, biological sequences, and high-frequency transactional systems (Valencia et al., 2016, Skalski et al., 2024).

The ongoing synthesis of kernel embedding theory, neural sequence modeling, and sufficiency characterizations continues to clarify the theoretical and practical landscape for next-generation predictive representation models.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Autoregression on Input Embedding.