Papers
Topics
Authors
Recent
Search
2000 character limit reached

Embedding-based Retrieval Fundamentals

Updated 14 April 2026
  • Embedding-based Retrieval (EBR) is a method that represents queries and items as dense vectors to enable efficient nearest neighbor search across diverse domains.
  • EBR systems typically use dual-encoder architectures with cosine similarity scoring and fast ANN libraries for sub-linear retrieval over massive corpora.
  • Recent advancements include hybrid models, probabilistic thresholding, and multi-vector approaches to overcome theoretical limitations and improve precision and recall.

Embedding-based retrieval (EBR) refers to retrieval systems that represent queries and items (such as documents, products, or entities) as dense vectors—embeddings—and recast the retrieval problem as a search for nearest neighbors in this embedding space. This approach provides a unified and highly scalable mechanism for a broad spectrum of information retrieval and recommendation tasks, ranging from web, product, and entity search to cross-modal applications. EBR has been widely adopted in large-scale commercial systems due to its effectiveness in bridging the semantic gap between queries and candidates and enabling sub-linear search over massive corpora.

1. Foundations: Architecture, Embedding Formation, and Scoring

Contemporary EBR systems almost universally adopt some variant of the dual-encoder (“two-tower”) paradigm, where separate neural towers encode the query and item independently (Huang et al., 2020, Zhang et al., 2022). These towers may be identically parameterized (as in the Siamese setting) or may differ, especially in cross-modal or multimodal configurations (Liang et al., 30 Jun 2025, He et al., 13 Oct 2025).

Given a user query qq and a candidate item dd, the encoders produce dd-dimensional vectors f(q)Rdf(q) \in \mathbb{R}^d and g(d)Rdg(d) \in \mathbb{R}^d, and the retrieval score is typically the cosine similarity or dot product:

S(q,d)=f(q)g(d)f(q)g(d)S(q, d) = \frac{f(q)^\top g(d)}{\|f(q)\|\|g(d)\|}

These representations may be further refined by combining multiple modalities (e.g., text, image, metadata as tokens in a multimodal transformer (He et al., 2023)), personalization signals (location, user history as in Etsy (Jha et al., 2023)), or knowledge graph features (Liu et al., 2019).

Item embeddings are pre-computed and indexed using fast Approximate Nearest Neighbor (ANN) libraries such as FAISS, HNSW, or product quantization variants (Huang et al., 2020, Rossi et al., 2024). At runtime, a query embedding is generated and the index is searched for high-scoring candidates.

2. Training Objectives, Loss Functions, and Optimization

EBR systems are predominantly trained using losses that push relevant query–item pairs together and irrelevant pairs apart, with several paradigmatic choices:

L=i=1Nlogexp(S(qi,di+)/τ)j=1Nexp(S(qi,dj)/τ)L = -\sum_{i=1}^N \log \frac{\exp(S(q_i, d_i^+)/\tau)}{\sum_{j=1}^{N}\exp(S(q_i, d_j^-)/\tau)}

with τ\tau as a learned or fixed temperature.

  • Margin-based losses: Pairwise hinge or triplet losses are sometimes used, though misalignment between margin loss and global top-K inference can degrade retrieval (Li et al., 2021). Softmax cross-entropy, matching serving conditions, increases both convergence and recall (Li et al., 2021).
  • Listwise/Ranking and Multi-task objectives: In high-value cases (ads, e-commerce), objectives may combine knowledge distillation from high-precision teachers, CTR/profitability, and click feedback (Zhang et al., 2022, Lin et al., 2024).

Hard negative sampling (including in-batch, in-device, or ANN-mined negatives) is critical for effective training (Huang et al., 2020, Li et al., 2021, Jha et al., 2023). In product and recommendation systems, label noise due to implicit feedback is mitigated with temperature smoothing, careful loss weighting, or explicit human feedback distillation (Li et al., 2021, Lin et al., 2024).

3. Advanced Variants, Filtering, and Calibration

3.1 Precision Filtering and Score Calibration

Dense embedding-based search, unlike lexical matching, can flood downstream ranking with low-relevance or “junk” items due to uncalibrated cosine scores. This is overcome using query-dependent calibration layers (e.g., the Cosine Adapter) that map raw similarities to absolute probabilities, enabling a global threshold τ\tau to prune irrelevant results while controlling recall (Rossi et al., 2024). Similarly, sigmoid score transforms and segment-specific thresholds in social network search balance junk removal and recall (Wang et al., 2023).

3.2 Probabilistic Thresholding

Frequentist EBR methods using fixed dd0 exhibit over- or under-retrieval as the set of relevant items per query varies; probabilistic approaches (pEBR) instead fit, per-query, a score distribution dd1 (e.g., ExpNCE, BetaNCE) that enables dynamic thresholds at target quantiles, providing both higher recall for head queries and improved precision for tails (Zhang et al., 2024).

3.3 Multi-embedding and Granularity

For structured or long documents, as in the legal domain, multi-layer EBR generates embeddings at several granularities (e.g., document, section, paragraph, enumeration) and retrieves at the chunk level most semantically aligned with the query (Lima, 2024). In web retrieval, multi-embedding frameworks select segments based on click frequencies to match diverse query intents (Li et al., 2022).

4. Compression, Efficiency, and Scalability

Industrial-scale EBR requires extreme efficiency in memory and query latency. Binary EBR (BEBR) compresses float32 embeddings into multi-level binary codes using recurrent MLP-based binarization, compatible with existing ANN backends and offering 30–50% memory and 2x latency reductions with minor accuracy loss (Gan et al., 2023). Product quantization and matching-oriented PQ (MoPQ) jointly optimized with the embedding model (as in Bing’s Uni-Retriever) increase recall, enabling billion-scale candidate pools (Zhang et al., 2022).

Backward-compatible binarization and smooth deployment across embedding versions are supported via embedding-to-embedding training and auxiliary contrastive losses (Gan et al., 2023).

5. Theoretical Limitations, Expressivity, and Hybrid Models

Single-vector EBR systems are fundamentally limited in the number of retrievable top-dd2 subsets by embedding dimension dd3:

dd4

With increasing dd5 and small dd6, many possible top-dd7 sets cannot be realized, even for simple dd8 (e.g., dd9) (Weller et al., 28 Aug 2025). This is empirically demonstrated by the LIMIT benchmark, where all SOTA single-vector EBR models fail to achieve full recall; only multi-vector or sparse hybrid models like BM25 or ModernColBERT overcome these constraints. This limitation becomes acute for instruction-following, logic-based, or attribute-combinatorial queries.

Hybrid systems, blending keyword and embedding-based retrieval, are deployed at scale (Facebook Group Search), with linear score fusion (weight dd0 tuned via LLM evaluation and A/B) to maximize both precision and diversity (Su et al., 17 Sep 2025). Multi-vector representations and late-interaction models (e.g., ColBERT) or cross-encoder rerankers are active research areas for breaking the expressivity bottleneck (Weller et al., 28 Aug 2025).

6. Application-Specific Extensions

EBR frameworks have been extended to multi-task retrieval (per-cluster task adaptation via prefix-tuning (Zhang et al., 2023)), personalized search (joint query-user encoders (Jha et al., 2023)), multimodal content moderation (vision, text fusion via supervised contrastive learning (Liang et al., 30 Jun 2025)), zero-shot retrieval via synthetic query generation (Liang et al., 2020), and retrieval-augmented generation over complex, hierarchical texts (Lima, 2024). In content-based recommendation and image retrieval, embedding distillation from large teacher models (vLLMs) transfers fine-grained alignment into scalable, dual-encoder systems (He et al., 13 Oct 2025).

7. Practical Lessons, Empirical Findings, and Future Directions

Successful deployment hinges on careful negative mining, ever-fresh ANN tuning, score calibration, label and semi-positive mining, typo and query-robust augmentation, and dedicated handling of “integrity” errors (harmful/junky content) (Wang et al., 2023, Rossi et al., 2024, Lin et al., 2024). Human-in-the-loop feedback (as in Walmart and Que2Engage) and explicit multitask losses outperform naïve label or click-based optimization.

Despite the flexibility and scalability of EBR, known shortcomings remain: the single-vector paradigm is provably limited, and high recall on truly compositional or instruction-rich queries is unattainable without multi-vector, reranking, hybrid, or sparse methods (Weller et al., 28 Aug 2025, Su et al., 17 Sep 2025). Current research is focused on relaxing these constraints, integrating more expressive architectures, and further automating calibration and adaptation across domains and modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Embedding-based Retrieval.