Generative Pretraining from Embeddings

Updated 19 December 2025

Generative Pretraining from Embeddings is a paradigm that predicts future embeddings in a learned space, enabling efficient self-supervised and autoregressive modeling across modalities.
It employs specialized architectures in vision, text, audio, and multimodal setups, utilizing techniques like stop-gradient and contrastive refinement to avoid embedding collapse.
Empirical results show improved scalability and transfer performance, with applications demonstrated in ImageNet classification, language embedding refinement, and audio reconstruction.

Generative Pretraining from Embeddings refers to a family of techniques where models are trained to generate semantically meaningful embedding vectors—rather than raw observations or discrete tokens—as part of the generative or predictive process. This paradigm shifts the focus from reconstructing data in pixel, waveform, or token space to learning to predict future embeddings (typically in a self-supervised or autoregressive manner), or to perform generative modeling directly within an embedding manifold. The approach has recently advanced across vision, text, audio, and multimodal domains, enabling efficient large-scale pretraining, simplified architectures, enhanced transfer performance, and unified model frameworks.

1. Task Formulation and Core Objectives

Generative pretraining from embeddings redefines the prediction target for generative models. Instead of reconstructing input data in its original space, the model predicts future or missing components in a learned embedding space. In the context of vision, Next-Embedding Predictive Autoregression (NEPA) formalizes this idea by partitioning an image $x$ into $T$ non-overlapping patches, embedding each patch as $e_t = f(\text{patch}_t(x)) \in \mathbb{R}^D$ , and autoregressively training a Transformer $h_\theta$ to predict $\hat e_{t+1} = h_\theta(e_1, \ldots, e_t)$ using a causal mask. The loss function aligns predicted and ground-truth embeddings $\tilde e_{t+1}$ and $\tilde{\hat e}_{t+1}$ via negative cosine similarity:

$\mathcal L_{\rm sim} = -\,\frac{1}{T-1} \sum_{t=1}^{T-1} \langle \tilde e_{t+1},\;\tilde{\hat e}_{t+1}\rangle.$

A key innovation is the application of stop-gradient to the target embeddings, which is critical to prevent trivial collapse and enforce meaningful dynamics (Xu et al., 18 Dec 2025).

For text, the GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings) framework leverages LLMs to generate a sequence of “soft” tokens in embedding space, explicitly training every step of this iterative process under a stepwise contrastive objective. Each intermediate embedding $z_k$ is optimized such that successive generations yield monotonically improving representations, enforced by an Iterative Contrastive Refinement (ICR) loss (Tsai et al., 29 Sep 2025).

In audio, pre-trained generative audio encoders are used to map waveforms to embeddings via masked reconstruction, after which a compact encoder denoises these embeddings; a generative vocoder synthesizes waveforms from the clean embeddings (Sun et al., 13 Jun 2025).

2. Model Architectures and Training Methodologies

Generative pretraining from embeddings typically incorporates specialized architecture components depending on modality:

Vision: NEPA utilizes a patch-based Conv2D embedder, learnable positional encodings, and a causal Transformer backbone with causal self-attention. Stabilization is achieved via Rotary Position Embedding (RoPE), LayerScale, SwiGLU activations, and QK-Norm. There is no pixel-space decoder or discrete codebook, minimizing architectural complexity and maximizing scalability (Xu et al., 18 Dec 2025).
Text: GIRCSE repurposes causal-decoder LLMs (e.g., Mistral-7B) without architectural modifications, relying on autoregressive generation of soft tokens mapped to embedding space via the shared token embedding matrix. A contrastive loss is applied to intermediate embeddings at each refinement step, with an explicit refinement regularizer to guarantee monotonic progress (Tsai et al., 29 Sep 2025).
Audio: Systems such as the one in (Sun et al., 13 Jun 2025) freeze a large generative encoder, train a lightweight ViT-based denoiser in embedding space, and reuse a pre-trained generative vocoder for final waveform regeneration.
Multimodal: Emu (Sun et al., 2023) employs an image encoder (e.g., EVA-CLIP), projects images into a small set of continuous visual tokens, and interleaves these with text tokens for causal sequence modeling via a Transformer. Both text and visual “tokens” become next-step prediction targets—classification loss for text, $\ell_2$ regression loss for embeddings.

3. Comparative Analysis and Key Distinctions

Generative pretraining from embeddings contrasts sharply with traditional paradigms across modalities:

	Generative Embedding Pretraining	Pixel/Token Reconstruction	Contrastive/Discriminative Training
Prediction target	Continuous embeddings	Input pixel values / tokens	Instance discrimination
Architecture	Autoregressive, causal Transformer	Often encoder-decoder or autoencoder	Siamese / teacher-student
Objective	MSE/cosine loss in embedding space	Pixel/token loss (MSE, cross-entropy)	Contrastive (InfoNCE, triplet)
Use of negatives/momentum	None	None	Required
Pixel/token space decoder	No	Yes	No
Direct downstream head	No	Task-specific head often required	Often additional projection heads
Scalability	High (shorter sequences)	Limited by output length	High

Vision NEPA remains strictly in embedding space, using a single forward pass and minimal auxiliary losses, whereas classical self-supervised methods such as masked image modeling require significant decoding overhead and potentially complicated negative mining (Xu et al., 18 Dec 2025).

Text embedding methods based on LLMs have traditionally used encoder-only forward passes; by contrast, GIRCSE explicitly exploits LLM generation to improve semantic coverage with iterative generation-refinement steps, resulting in emergent scaling properties (Tsai et al., 29 Sep 2025).

Classical NLDR-based generative pipelines suffer from sparsity and discrete, non-interpolable geometry, resulting in poor generative quality even after addition of neural decoders and diffusion models operating on embeddings. This demonstrates that not all embedding spaces are suitable for generative pretraining; they must be structured for both geometric regularity and reconstructibility (Thakare et al., 15 Oct 2025).

4. Empirical Results and Transferability

Empirical evaluations indicate that generative pretraining from embedding spaces achieves or surpasses the transfer performance of more elaborate paradigms:

Vision (NEPA): On ImageNet-1K, NEPA-ViT-B attains 83.8% top-1 and NEPA-ViT-L achieves 85.3% after fine-tuning. For semantic segmentation on ADE20K, NEPA-ViT-B + UPerNet yields 48.3% mIoU, and NEPA-ViT-L + UPerNet 54.0%. This demonstrates that pure embedding prediction is on par with, or exceeds, contrastive or masked-reconstruction approaches, with a far simpler setup (Xu et al., 18 Dec 2025).
Text (GIRCSE): On the MTEB English v2 suite, GIRCSE achieves an average score of 67.83 (Mistral backbone, 0.2M data)—a 0.87 gain over best fair encoder-only baselines. In instruction-following, the gain is more pronounced (62.97 vs 47.05 for best causal-EOS). Crucially, increasing the number of inference steps (soft-token generations) monotonically improves embedding quality at inference time, a property not observed in encoder-only systems (Tsai et al., 29 Sep 2025).
Multimodal (Emu): Emu records zero-shot COCO captioning CIDEr of 112.4 (vs Flamingo-9B’s ~79.4), zero-shot VQAv2 at ~52.0%, four-shot VQAv2 at 58.4%, and zero-shot text-to-image FID of 11.66. This illustrates that a unified embedding-predictive sequence, exchanging both visual and textual tokens in one model, supports strong generalization (Sun et al., 2023).
Audio: The embedding-based SE system in (Sun et al., 13 Jun 2025) outperforms discriminative deep-feature and waveform SE systems, with much higher parameter efficiency and improved speaker fidelity.
NLDR-based Decoders: Attempts to attach decoders and train diffusion models in fixed t-SNE, Isomap, or LLE manifolds yield reconstructions (MSE ~0.045–0.05) that are inferior to end-to-end autoencoders (MSE ~0.017). Diffusion sampling on such manifolds leads to FID scores >150, compared to ~25 for direct pixel-space models, highlighting critical limitations (Thakare et al., 15 Oct 2025).

5. Applications and Practical Considerations

Generative pretraining from embeddings is highly adaptable across domains:

Vision: Used for self-supervised learning where no pixel-level recovery is required, maximizing efficiency and scalability, and enabling simple, robust architectures (Xu et al., 18 Dec 2025).
Text: Facilitates leveraging the generative capacity of LLMs for embeddings, producing higher-quality representations and supporting tunable inference-time compute (Tsai et al., 29 Sep 2025).
Audio: Enables decoupling feature extraction (frozen, generative pre-trained audio encoders), efficient denoising, and high-quality waveform reconstruction with minimal trainable parameters (Sun et al., 13 Jun 2025).
Multimodal: Supports models that handle text, images, and even video as part of a unified sequence, using a shared generative objective over both embedding and token targets (Sun et al., 2023).

Key practical considerations include:

Selection of embedding space: Embeddings must support smooth interpolation and reconstructibility. Fixed, geometric-only NLDR embeddings are often inappropriate for generative modeling absent joint optimization (Thakare et al., 15 Oct 2025).
Regularization: The application of stop-gradient or explicit refinement objectives is essential to avoid collapse and enforce meaningful learning (Xu et al., 18 Dec 2025, Tsai et al., 29 Sep 2025).
Inference scalability: In iterative refinement, computational budget at test time can directly increase downstream performance without additional training (Tsai et al., 29 Sep 2025).

6. Limitations and Open Challenges

Despite empirical successes, generative pretraining from embeddings is subject to certain limitations:

Embedding collapse: Without stop-gradient or regularization, models may minimize loss trivially by collapsing all predictions to constants (Xu et al., 18 Dec 2025).
Embedding manifold geometry: Classical embeddings from NLDR methods are sparse and ill-suited for generative diffusion, as they lack the required density and continuity for plausible interpolation (Thakare et al., 15 Oct 2025).
Task breadth: Not all downstream tasks will benefit equally; for instance, tasks requiring precise, low-level reconstructions may require access to original input domains or hybrid objectives.
Modality jointness: While embedding-based pretraining supports unification across modalities, effective transfer and generalization depend on the expressiveness and compatibility of the chosen embedding representations (Sun et al., 2023).
Interpretability: The interpretability of generated embeddings vs interpretable discrete tokens is an ongoing area for research and practical scrutiny.

A plausible implication is that future advances may combine embedding-based objectives with geometric or reconstruction-aware regularizations, or devise end-to-end differentiable frameworks that unify geometry and generativity (Thakare et al., 15 Oct 2025). End-to-end models that support both efficient pretraining and rich, generative representations aligned to multiple modalities are likely to drive further adoption of generative pretraining from embeddings.