Papers
Topics
Authors
Recent
2000 character limit reached

On-Demand Embedding Generation

Updated 7 January 2026
  • On-demand embedding generation is a method that dynamically computes latent vector representations at inference time, allowing for immediate adaptation to new inputs.
  • Key methodologies include static token lookup, lifelong embedding extension, and multi-modal architectures, each optimized for ultra-low latency and scalable performance.
  • Practical implementations demonstrate significant gains in speed and accuracy, with applications in real-time search, recommendation, and handling out-of-vocabulary data.

On-demand embedding generation refers to the set of methodologies, systems, and architectures that compute latent vector representations—embeddings—for new or evolving inputs dynamically at inference time, as opposed to relying solely on offline, precomputed vectors. Motivated by requirements for ultra-low latency, adaptability to new tokens or entities, and scalability in systems such as real-time search, recommendation, and language understanding, on-demand embedding techniques span static token lookup, lifelong embedding extension, dynamic neural computation, and generative model adaptation. This paradigm supports both real-time and evolving environments, from edge deployment to large-scale multi-user platforms.

1. Static Token Lookup and Real-Time Embedding with Precomputed Tables

Static token lookup enables sub-millisecond embedding generation by replacing computationally expensive deep models, such as transformers, with pointer dereferencing and fast aggregation over a precomputed token embedding matrix. SwiftEmbed exemplifies this approach by storing a compact matrix ERV×dE \in \mathbb{R}^{|V|\times d} (with V30000|V| \approx 30\,000, d=384d=384), memory mapping it as a contiguous buffer, and ensuring alignment to CPU cache-line boundaries for maximal SIMD throughput. The embedding computation pipeline consists of tokenization, token-index lookup, optional attention weighting (e.g., TF-IDF), highly optimized mean pooling via AVX2/AVX512, and L2 normalization:

h=1i=1maii=1maiei,  f=hh2\mathbf{h} = \frac{1}{\sum_{i=1}^m a_i} \sum_{i=1}^m a_i \mathbf{e}_i,~~ \mathbf{f} = \frac{\mathbf{h}}{\|\mathbf{h}\|_2}

SwiftEmbed achieves 1.12 ms median latency (p50), 50,000 RPS throughput, and approximately 89% of contextual model MTEB quality, with negligible (<0.3 ms) serialization overhead in binary IEEE754 (Lansiaux, 27 Oct 2025).

This approach is particularly advantageous in latency-sensitive applications such as duplicate detection, semantic similarity, and domain-specific settings, where, for scientific or technical domains (characterized by low polysemy), static lookups can outperform contextual transformers (e.g., up to 131% relative performance in scientific tasks). The memory and algorithmic complexity is strictly linear (O(n+d)O(n+d), nn = tokens), in contrast to the quadratic complexity of transformers (Lansiaux, 27 Oct 2025).

2. Lifelong Extension and Dynamic Vocabulary Growth

In settings where the entity or token set evolves (e.g., e-commerce catalogs, user IDs), on-demand techniques extend the embedding tables incrementally without retraining from scratch. The algorithmic principle is to preserve previously learned rows in EoldRM×dE_{\text{old}} \in \mathbb{R}^{M \times d} and copy them into an extended EnewRN×dE_{\text{new}} \in \mathbb{R}^{N \times d}, while initializing embeddings for new entries using strategies such as the unknown token vector, global or category mean, or nearest-neighbor imputation. Formally,

Enew[new_map(i),:]={Eold[old_map(i),:]iVold winitotherwiseE_{\text{new}}[\text{new\_map}(i),:] = \begin{cases} E_{\text{old}}[\text{old\_map}(i),:] & i \in V_{\text{old}} \ w_{\text{init}} & \text{otherwise} \end{cases}

An L2-anchoring regularization term can be introduced to prevent drift of the old embeddings during subsequent fine-tuning:

Ltotal=Ltask(X;Enew)+λiVoldEnew[new_map(i)]Eold[old_map(i)]22L_{\text{total}} = L_{\text{task}}(X; E_{\text{new}}) + \lambda \sum_{i \in V_{\text{old}}} \| E_{\text{new}}[\text{new\_map}(i)] - E_{\text{old}}[\text{old\_map}(i)] \|_2^2

Empirical results demonstrate this incremental extension not only prevents catastrophic forgetting but also outperforms retraining from scratch, with the best cold-start strategy ("unknown" token initialization) yielding AUC improvements of 0.05\sim0.05 on purchase prediction benchmarks (Gomes et al., 2024).

3. Multi-Modal and Neural On-Demand Embedding Architectures

Recent work leverages LLMs and multi-modal transformers to unify generative and embedding capabilities within a single model. The Multi-Modal Generative Embedding Model (MM-GEM) framework encapsulates contrastive (embedding) and generative (captioning) objectives in one transformer backbone, using shared projection heads and a PoolAggregator module to yield both global and region-level embeddings on demand:

  • Global embedding: mean-pool vision encoder output, project and normalize.
  • Region embedding: RoIAlign-pool on a region, project via a trainable head, and normalize.

Training proceeds in two stages—global alignment followed by region-level specialization—to maintain competitive retrieval and captioning performance across tasks. For text, appending a special [EMB] token directs the model to produce embeddings from the terminal hidden state, with InfoNCE contrastive loss:

ti=h2(LLMθ(Ti[EMB])),  Lvt, Ltvt_i = \frac{h_2(\text{LLM}_\theta(T_i \oplus [\mathrm{EMB}]))}{\| \cdot \|},~~ \mathcal{L}_{v \to t},~\mathcal{L}_{t \to v}

On-demand inference is supported for both image and text queries, either globally or at arbitrary regions, with competitive Recall@1 and classification metrics relative to dual-encoder and specialized baselines (Ma et al., 2024).

4. On-The-Fly Embedding Computation for Rare or Out-of-Vocabulary Inputs

For rare tokens or entities, on-the-fly embedding methods dynamically synthesize vectors using auxiliary data such as character sequences or dictionary definitions. The encoder ff generates embeddings by mean pooling, linear projection, or (bi)LSTM encoding over the auxiliary representation:

ed(w)=f(a(w);θ),   ec(w)={e(w)+Wed(w),wVtrain ed(w),otherwisee_d(w) = f(a_{(w)};\theta), ~~~ e_c(w) = \begin{cases} e(w) + W e_d(w), & w \in V_{\text{train}} \ e_d(w), & \text{otherwise} \end{cases}

The downstream model is trained end-to-end with gradients flowing through ff, ensuring that the auxiliary encoder is tuned for the task at hand. Empirical comparisons in reading comprehension, textual entailment, and language modeling demonstrate large performance gains from incorporating spelling and dictionary information, especially in data-constrained and OOV settings (Bahdanau et al., 2017).

5. Unified On-Demand Embeddings with Generative LLMs

Generative LLMs can be extended to provide high-quality, on-demand embeddings without bifurcation between encoder and decoder. The GEM approach introduces special summarization tokens (e.g., <EVEC>) and specialized attention masks at the end of the input, forcing the model to compress preceding content into these bottleneck tokens. The embedding is extracted as the (mean or concatenated) hidden state(s) of the special tokens, followed by L2 normalization:

e^=ee2\hat{e} = \frac{e}{\|e\|_2}

Training combines next-token prediction and self-supervised contrastive loss, with a progressive blend to preserve language generation ability. GEM enables a single decoder-only LLM to offer both text generation and RAG-compatible embedding services, closing the performance gap to dedicated embedding models (e.g., Llama 3 3B: MTEB 59.06 vs. baseline 21.60) (Zhang et al., 4 Jun 2025).

6. Systemic Architectures for Scalable, Continual, and Versioned Embedding Serving

Production-scale on-demand embedding pipelines, such as Twitter's, orchestrate batch and online steps via DAG-based workflow engines (e.g., Airflow), integrate multiple embedding model types (Word2Vec, ALS, SVD, co-occurrence, autoencoder), and version embeddings in a central Feature Registry. Key features are:

  • Online fold-in: for novel entities, a sparse interaction vector is multiplied with a fold-in matrix to synthesize the new embedding.
  • Versioning: all embeddings are versioned and benchmarked, allowing serving layers to select by task or freshness.
  • Covariate-shift detection: triggers retraining when new data distributions emerge, based on overlap and performance metrics.
  • Evaluation: embedding quality tracked via ROC-AUC on topic or metadata prediction, Jaccard similarity correlation, and age-based staleness curves.

The architecture guarantees up-to-date, low-latency embeddings for dynamic entities, while supporting incremental retraining or online update rules for continual adaptation (Shiebler et al., 2018).

7. Prompted and Instruction-Based On-Demand Embedding in Multimodal LLMs

Instruction-tuned generative multimodal LLMs can be adapted into universal on-demand embedding engines by leveraging hierarchical prompting. The framework employs a two-level instruction template—system and user levels—to compel the model to output discriminative, compressed representations via its generative hidden state space. Embeddings are extracted as L2-normalized final hidden vectors, optionally via a learned projection. Hard negative mining (self-aware hard negative sampling, SaHa) enhances discriminative power during fine-tuning with InfoNCE loss:

Li=log[exp(cos(eqi,eci+)/τ)exp(cos(eqi,eci+)/τ)+jiexp(cos(eqi,ecj+)/τ)]L_i = -\log \left[\frac{\exp(\cos(e_{q_i}, e_{c_i^+})/\tau)}{\exp(\cos(e_{q_i}, e_{c_i^+})/\tau) + \sum_{j\neq i} \exp(\cos(e_{q_i}, e_{c_j^+})/\tau)} \right]

Zero-shot and fine-tuned results in the MMEB benchmark show that such hierarchical prompting is essential: zero-shot performance rises from 13.9 to 43.3 (Precision@1) with prompt, and full fine-tuned SaHa yields state-of-the-art performance (67.1), closing the gap to contrastively-trained encoders without the full data and compute requirements (Ju et al., 1 Aug 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to On-Demand Embedding Generation.