Embedding-based Retrieval Fundamentals
- Embedding-based Retrieval (EBR) is a method that represents queries and items as dense vectors to enable efficient nearest neighbor search across diverse domains.
- EBR systems typically use dual-encoder architectures with cosine similarity scoring and fast ANN libraries for sub-linear retrieval over massive corpora.
- Recent advancements include hybrid models, probabilistic thresholding, and multi-vector approaches to overcome theoretical limitations and improve precision and recall.
Embedding-based retrieval (EBR) refers to retrieval systems that represent queries and items (such as documents, products, or entities) as dense vectors—embeddings—and recast the retrieval problem as a search for nearest neighbors in this embedding space. This approach provides a unified and highly scalable mechanism for a broad spectrum of information retrieval and recommendation tasks, ranging from web, product, and entity search to cross-modal applications. EBR has been widely adopted in large-scale commercial systems due to its effectiveness in bridging the semantic gap between queries and candidates and enabling sub-linear search over massive corpora.
1. Foundations: Architecture, Embedding Formation, and Scoring
Contemporary EBR systems almost universally adopt some variant of the dual-encoder (“two-tower”) paradigm, where separate neural towers encode the query and item independently (Huang et al., 2020, Zhang et al., 2022). These towers may be identically parameterized (as in the Siamese setting) or may differ, especially in cross-modal or multimodal configurations (Liang et al., 30 Jun 2025, He et al., 13 Oct 2025).
Given a user query and a candidate item , the encoders produce -dimensional vectors and , and the retrieval score is typically the cosine similarity or dot product:
These representations may be further refined by combining multiple modalities (e.g., text, image, metadata as tokens in a multimodal transformer (He et al., 2023)), personalization signals (location, user history as in Etsy (Jha et al., 2023)), or knowledge graph features (Liu et al., 2019).
Item embeddings are pre-computed and indexed using fast Approximate Nearest Neighbor (ANN) libraries such as FAISS, HNSW, or product quantization variants (Huang et al., 2020, Rossi et al., 2024). At runtime, a query embedding is generated and the index is searched for high-scoring candidates.
2. Training Objectives, Loss Functions, and Optimization
EBR systems are predominantly trained using losses that push relevant query–item pairs together and irrelevant pairs apart, with several paradigmatic choices:
- Contrastive InfoNCE (softmax) loss: Used by dual encoders in Facebook, Bing, Taobao, and others (Huang et al., 2020, Zhang et al., 2024, Li et al., 2022). For a batch of query–positive pairs, and in-batch negatives:
with as a learned or fixed temperature.
- Margin-based losses: Pairwise hinge or triplet losses are sometimes used, though misalignment between margin loss and global top-K inference can degrade retrieval (Li et al., 2021). Softmax cross-entropy, matching serving conditions, increases both convergence and recall (Li et al., 2021).
- Listwise/Ranking and Multi-task objectives: In high-value cases (ads, e-commerce), objectives may combine knowledge distillation from high-precision teachers, CTR/profitability, and click feedback (Zhang et al., 2022, Lin et al., 2024).
Hard negative sampling (including in-batch, in-device, or ANN-mined negatives) is critical for effective training (Huang et al., 2020, Li et al., 2021, Jha et al., 2023). In product and recommendation systems, label noise due to implicit feedback is mitigated with temperature smoothing, careful loss weighting, or explicit human feedback distillation (Li et al., 2021, Lin et al., 2024).
3. Advanced Variants, Filtering, and Calibration
3.1 Precision Filtering and Score Calibration
Dense embedding-based search, unlike lexical matching, can flood downstream ranking with low-relevance or “junk” items due to uncalibrated cosine scores. This is overcome using query-dependent calibration layers (e.g., the Cosine Adapter) that map raw similarities to absolute probabilities, enabling a global threshold to prune irrelevant results while controlling recall (Rossi et al., 2024). Similarly, sigmoid score transforms and segment-specific thresholds in social network search balance junk removal and recall (Wang et al., 2023).
3.2 Probabilistic Thresholding
Frequentist EBR methods using fixed 0 exhibit over- or under-retrieval as the set of relevant items per query varies; probabilistic approaches (pEBR) instead fit, per-query, a score distribution 1 (e.g., ExpNCE, BetaNCE) that enables dynamic thresholds at target quantiles, providing both higher recall for head queries and improved precision for tails (Zhang et al., 2024).
3.3 Multi-embedding and Granularity
For structured or long documents, as in the legal domain, multi-layer EBR generates embeddings at several granularities (e.g., document, section, paragraph, enumeration) and retrieves at the chunk level most semantically aligned with the query (Lima, 2024). In web retrieval, multi-embedding frameworks select segments based on click frequencies to match diverse query intents (Li et al., 2022).
4. Compression, Efficiency, and Scalability
Industrial-scale EBR requires extreme efficiency in memory and query latency. Binary EBR (BEBR) compresses float32 embeddings into multi-level binary codes using recurrent MLP-based binarization, compatible with existing ANN backends and offering 30–50% memory and 2x latency reductions with minor accuracy loss (Gan et al., 2023). Product quantization and matching-oriented PQ (MoPQ) jointly optimized with the embedding model (as in Bing’s Uni-Retriever) increase recall, enabling billion-scale candidate pools (Zhang et al., 2022).
Backward-compatible binarization and smooth deployment across embedding versions are supported via embedding-to-embedding training and auxiliary contrastive losses (Gan et al., 2023).
5. Theoretical Limitations, Expressivity, and Hybrid Models
Single-vector EBR systems are fundamentally limited in the number of retrievable top-2 subsets by embedding dimension 3:
4
With increasing 5 and small 6, many possible top-7 sets cannot be realized, even for simple 8 (e.g., 9) (Weller et al., 28 Aug 2025). This is empirically demonstrated by the LIMIT benchmark, where all SOTA single-vector EBR models fail to achieve full recall; only multi-vector or sparse hybrid models like BM25 or ModernColBERT overcome these constraints. This limitation becomes acute for instruction-following, logic-based, or attribute-combinatorial queries.
Hybrid systems, blending keyword and embedding-based retrieval, are deployed at scale (Facebook Group Search), with linear score fusion (weight 0 tuned via LLM evaluation and A/B) to maximize both precision and diversity (Su et al., 17 Sep 2025). Multi-vector representations and late-interaction models (e.g., ColBERT) or cross-encoder rerankers are active research areas for breaking the expressivity bottleneck (Weller et al., 28 Aug 2025).
6. Application-Specific Extensions
EBR frameworks have been extended to multi-task retrieval (per-cluster task adaptation via prefix-tuning (Zhang et al., 2023)), personalized search (joint query-user encoders (Jha et al., 2023)), multimodal content moderation (vision, text fusion via supervised contrastive learning (Liang et al., 30 Jun 2025)), zero-shot retrieval via synthetic query generation (Liang et al., 2020), and retrieval-augmented generation over complex, hierarchical texts (Lima, 2024). In content-based recommendation and image retrieval, embedding distillation from large teacher models (vLLMs) transfers fine-grained alignment into scalable, dual-encoder systems (He et al., 13 Oct 2025).
7. Practical Lessons, Empirical Findings, and Future Directions
Successful deployment hinges on careful negative mining, ever-fresh ANN tuning, score calibration, label and semi-positive mining, typo and query-robust augmentation, and dedicated handling of “integrity” errors (harmful/junky content) (Wang et al., 2023, Rossi et al., 2024, Lin et al., 2024). Human-in-the-loop feedback (as in Walmart and Que2Engage) and explicit multitask losses outperform naïve label or click-based optimization.
Despite the flexibility and scalability of EBR, known shortcomings remain: the single-vector paradigm is provably limited, and high recall on truly compositional or instruction-rich queries is unattainable without multi-vector, reranking, hybrid, or sparse methods (Weller et al., 28 Aug 2025, Su et al., 17 Sep 2025). Current research is focused on relaxing these constraints, integrating more expressive architectures, and further automating calibration and adaptation across domains and modalities.