Embedding-Only Training

Updated 15 November 2025

Embedding-only training is a machine learning paradigm that updates model parameters exclusively through losses on fixed-length embeddings from pre-trained encoders.
Techniques like embedding inversion and noise injection are used to bridge modality gaps, enabling tasks such as zero-shot captioning and retrieval.
Empirical results demonstrate competitive performance with reduced dependence on paired data, paving the way for advancements in semi- and weakly supervised learning.

Embedding-only training refers to a family of machine learning paradigms in which all learnable model parameters are updated exclusively via losses and objectives that act on vector embeddings—fixed-length representations of data—rather than through any direct pixel-, waveform-, or category-level supervision involving the original raw modalities. These approaches typically leverage powerful pre-trained encoders (vision-LLMs, audio-LLMs, or LLMs) whose embedding spaces are assumed to encode the necessary semantic or multimodal correspondences. Supervision is supplied wholly via embeddings, often exploiting modalities with abundant unpaired data, and tasks such as captioning, retrieval, or classification are reframed as mapping to, from, or between these joint embedding spaces. Embedding-only training has enabled state-of-the-art performance with substantially reduced dependence on paired or labeled multimodal data and has stimulated new lines of inquiry in semi- and weakly supervised representation learning.

1. Core Principles of Embedding-Only Training

The essential principle of embedding-only training is the decoupling of parameter updates from direct access to paired cross-modal supervision. Instead, models are structured such that losses—typically autoregressive or contrastive—are computed solely on embeddings produced by frozen (pre-trained) modality-specific encoders. In typical use cases:

All encoders (e.g., CLIP’s image and text encoders, CLAP’s audio and text encoders, LLMs for text) are held fixed.
The only learned parameters reside in decoders, mapping networks, or additional heads (e.g., a decoder for natural language generation conditioned on embeddings, or a projection head for fixed-dimension representations).
Training is performed exclusively on one accessible modality (usually text), and embeddings from other modalities (images, audio) are used only at inference/test time.
The practical implication is that large, unpaired datasets can be fully utilized, as the embedding geometry is already aligned to cross-modal semantics via contrastive pre-training.

This paradigm relies crucially on the joint embedding spaces produced by models such as CLIP, CLAP, and LLM-derived encoders, which are optimized to align semantically corresponding inputs from disparate modalities.

2. Representative Methodologies

A variety of concrete architectures and training schemes instantiate embedding-only training across tasks and modalities. Some canonical approaches include:

a) Embedding-Inversion for Captioning (CapDec, WSAC)

For image captioning (Nukrai et al., 2022) and automated audio captioning (Kouzelis et al., 2023), embedding-only training proceeds as follows:

Frozen encoders: CLIP (or CLAP) encoders for both modalities (e.g., image and text or audio and text) remain unchanged.
Decoder learning: A trainable decoder (autoregressive LLM or shallow decoder, such as GPT-2 or a 4-layer transformer) is optimized to reconstruct text, conditioned only on the corresponding text modality embedding.
Train/test decoupling: At training, the decoder learns to invert text-encoder embeddings (via text-only data); at inference, image or audio embeddings (in the same space) are fed to the decoder, yielding zero-shot cross-modal generation.

b) Noise and Shift for Modality Gap Mitigation

Motivation: Encoders’ text and image (or audio) embeddings tend to differ by a "modality gap," causing inference-time failures if not addressed.
Noise injection: Gaussian noise is added to text embeddings during training to simulate this gap, forcing decoders to generalize beyond the manifold of training embeddings (e.g. $\epsilon^2 = 0.016$ for CLIP (Nukrai et al., 2022), $\sigma^2 = 0.013$ for CLAP (Kouzelis et al., 2023)).
Alternative strategies: Embedding shift (aligning means of modality embeddings) and memory/projection-based inference strategies act as gap-bridging mechanisms.

c) Purely Embedding-Based Supervised Learning

Text embedding methods such as Piccolo2 (Huang et al., 2024), NV-Embed (Lee et al., 2024), and KaLM-Embedding-V2 (Zhao et al., 26 Jun 2025) adopt a training regime in which all tasks (retrieval, classification, clustering, semantic similarity, etc.) are formulated such that learning signals flow exclusively through a shared embedding pipeline:

Unified architecture: A shared encoder (BERT, LLM, or similar) plus a projection head. All inputs (e.g., sentence, document, label string) are passed through this pipeline; no auxiliary MLPs or decoders.
Multi-task objectives: Task labels are mapped to embedding representations (e.g., label-contrastive losses in classification), and task loss functions are always formulated over similarity or contrast in the embedding space.
Loss functions: InfoNCE, CoSent, and similar contrastive/ranking losses are mixed, but all operate strictly on the embeddings.

d) Multilayer Factorized Embedding Training

In recommendation systems, multi-layer embedding training (MLET) (Deng et al., 2023) replaces the standard single-layer embedding table with a factorization into two matrices (e.g., $W_1 \in \mathbb{R}^{V \times k}$ , $W_2 \in \mathbb{R}^{k \times d}$ ), trained jointly and later collapsed for inference.

3. Theoretical Foundations and Adaptive Mechanisms

Embedding-only training effectiveness often hinges on the geometry and dynamics of updates within the embedding space:

Cross-category learning: Factorized embeddings in MLET (Deng et al., 2023) induce dense, adaptive updates across all category embeddings, not just those represented in the current batch, via mechanisms modulated by singular vectors of $W_1$ and $W_2$ .
Bridging the modality gap: Injecting Gaussian noise or explicit shifts during training compels the decoder to operate robustly in the union of modality subspaces, thus improving zero-shot generalization (shown by ablation in (Nukrai et al., 2022, Kouzelis et al., 2023)).
Adaptive weighting: Focal-style loss reweighting as in KaLM-Embedding-V2 (Zhao et al., 26 Jun 2025) increases the gradient emphasis on hard examples, focusing embedding learning where model uncertainty or error is largest.

These mechanisms are typically analyzed using spectral decompositions (for factorized embeddings) or by ablating the form and scale of regularization (for gap closure), with empirical optima correlating with theoretical predictions.

4. Typical Training and Optimization Procedures

General patterns in training embedding-only systems are:

Frozen or minimally parameterized encoders: All large shared encoders (vision-language, audio-language, or LLMs) are kept fixed, substantially reducing compute requirements and domain overfitting risk.
Mapping networks: Lightweight MLPs or small transformers may be used to process embeddings for compatibility with downstream models (e.g., projection to GPT-2’s context space, prefix/instruction conditioning).
Optimization hyperparameters: Standard optimization tools predominate (AdamW, moderate to large batch sizes, learning rates $\sim 2\times 10^{-5}$ , weight decay). Noise or perturbation scales are estimated directly from embedding statistics on small samples.
Multi-task batch sampling: When multiple tasks are targeted, as in Piccolo2 (Huang et al., 2024) or KaLM-Embedding-V2 (Zhao et al., 26 Jun 2025), each training batch is associated with a single loss computed only from embeddings and labels (embedded as text).
Contrastive, retrieval, and ranking losses: InfoNCE with in-batch or hard negatives, CoSent for STS, and classification-formulated contrastive losses are the norm; these ensure that all learning signals are operationalized via embedding distances or similarities.

5. Empirical Performance and Benchmark Comparisons

Embedding-only methods consistently demonstrate strong performance across diverse tasks:

Task/Domain	Embedding-Only Model	Benchmark / Metric	Zero/non-paired data	Result vs. Baseline
Image Captioning	CapDec (Nukrai et al., 2022)	MS-COCO, Karpathy split (CIDEr)	Yes	91.8 (CapDec) vs. 34.5 (ZeroCap), 49.3 (MAGIC)
Audio Captioning	WSAC (Kouzelis et al., 2023)	AudioCaps, SPIDEr	Yes	0.403 vs. 0.485 (fully sup.) (~83%)
Text Embedding	Piccolo2 (Huang et al., 2024)	CMTEB (avg)	N/A	70.95 vs. 69.07 (prev SOTA)
Text Embedding	NV-Embed (Lee et al., 2024)	MTEB (avg, 56 tasks)	N/A	69.32 (SOTA, outperforming BERT/T5 models)
Text Embedding	KaLM-V2 (Zhao et al., 26 Jun 2025)	MTEB Chinese/English	N/A	68.15/67.47 (mini-instruct)
Recommendation Sys.	MLET (Deng et al., 2023)	CTR (AUC)	N/A	0.804 (DLRM, Criteo) vs. 0.803 (SL), size↓4x

Results show that, with adequate mitigation of domain gaps and sufficient embedding dimension, embedding-only approaches can reach or exceed 70–83% of fully supervised models’ performance in zero-shot or weakly supervised regimes. Notably, these techniques outperform previous unsupervised or weakly supervised baselines by large margins, particularly in cross-domain and style-guided tasks.

6. Limitations, Extensions, and Future Work

Key limitations and potential directions emerging from the literature include:

Incomplete modality alignment: Current methods rely on high-quality alignment of embedding spaces (e.g., CLIP’s image-text, CLAP’s audio-text). Where this alignment is weak (e.g., for non-standard scripts, low-resource languages), model performance degrades.
Heuristic gap closure: Noise injection and embedding shifts are effective but not principled. Future work may develop learnable or data-driven mapping layers, especially with limited paired data.
Negative sampling and hard-mining: Effective contrastive learning in large embedding spaces often requires sophisticated hard-negative strategies, which can be resource-intensive to implement at scale (mitigated in KaLM-V2 by online mixing (Zhao et al., 26 Jun 2025)).
Cross-task generalization: While embedding-only models can be highly flexible (e.g., style transfer, cross-domain captioning), the degree of transfer depends on the extent of overlapping semantics in the embedding geometries.

Proposed extensions include adapting these paradigms to music-captioning, visual question answering, document OCR (via domain-specific VLMs), and learning negative-sample synthesis or dynamic perturbation processes. Embedding-only training continues to broaden the scope of resource-efficient, generalist machine learning by exploiting the leverage of pre-trained, high-capacity representation spaces.