Embedding-Based Retrieval (EBR)

Updated 3 July 2025

Embedding-Based Retrieval (EBR) is a method that learns vector representations in a shared semantic space, powering scalable and context-aware search across diverse applications.
It employs dual encoder neural architectures along with techniques like quantization, negative mining, and binary embeddings to efficiently match queries and items.
EBR integrates multimodal and contextual signals—including user behavior and social graph data—to enhance personalization and improve real-world search performance.

Embedding-Based Retrieval (EBR) is a method that learns vector representations for queries and items (such as documents, products, ads, or multimedia) in a shared semantic space, enabling large-scale retrieval via similarity search (typically with approximate nearest neighbor algorithms). Unlike classical text retrieval methods relying only on keyword matching or manually designed rules, EBR is central to modern search and recommendation systems, powering applications across social networks, e-commerce, sponsored search, and content moderation.

1. Unified Neural Representation and Modeling

At the core of EBR are neural architectures—most notably two-tower (dual encoder) models—which map each query $Q$ and candidate item $D$ to dense embeddings $E_Q = f(Q), E_D = g(D)$ in a shared space. The similarity function, most often cosine similarity,

$S(Q, D) = \frac{\langle E_Q, E_D\rangle}{\|E_Q\|\cdot\|E_D\|}$

dictates retrieval ranking, enabling efficient nearest neighbor search.

Distinct from traditional text dual encoders, industry-scale systems like Facebook Search and Etsy Search have extended the embedding approach by incorporating non-textual, contextual, and social graph signals into the embedding process (Huang et al., 2020, Jha et al., 2023). For instance, query encoders take into account user profile data, location, social connections, and prior behaviors, while document/item encoders integrate social graph features, entity metadata, and product attributes. This unification allows capturing personalized intent—such as surfacing socially close entities in people search or tailoring product results to user shopping behaviors.

2. System Architectures and Scalability

Scalable deployment of EBR at web scale hinges on integrating fast, resource-efficient retrieval infrastructure. Facebook Search extended its traditional inverted-index search with quantized document embeddings indexed using methods like IVF and product quantization (PQ), facilitating hybrid Boolean-semantic queries and efficient approximate nearest neighbor (ANN) search (Huang et al., 2020). Systems typically compute and store document/item embeddings in batch (offline), while query/user embeddings are generated in real time during requests.

Bi-granular architectures further enhance scalability: at Microsoft, sponsored ads retrieval uses lightweight "sparse" (quantized) embeddings in memory for fast broad candidate selection, followed by on-disk "dense" embeddings for precise re-ranking, thereby fitting billion-item indices in moderate RAM (Xiao et al., 2022).

Recent advancements also include binary embedding engines (e.g., BEBR at Tencent (Gan et al., 2023)) where float embeddings are compressed to multi-bit binary codes, reducing index cost by 30–50% while maintaining accuracy.

3. Optimization: Training Objectives, Negative Mining, and Full-Stack Tuning

EBR effectiveness depends critically on both architectural design and the specifics of the training regime.

Training Losses: Modern systems have moved from margin-based triplet or pairwise losses toward softmax cross-entropy objectives on the entire candidate pool, aligning training more closely with inference (global top-K selection) (Li et al., 2021). When user objective hierarchies exist (e.g., relevance → exposure → click → purchase), sequential and hierarchical multi-objective optimization is used, with tailored sample construction and loss weighting (as in MOPPR (Zheng et al., 2022) and CSMF (Deng et al., 17 Apr 2025)).

Negative and Hard Negative Mining: Hard mining is essential for discriminative retrieval. Both online batch-based hard negative selection and offline mining (selecting "hard" negatives from high but non-top retrieval ranks) are widely employed, as are "semi-positives" for uncertain cases (Huang et al., 2020, Lin et al., 9 Aug 2024). Ensembles over models tuned to different negative strengths (easy vs. hard) improve recall and precision.

Full-Stack Optimization: Embedding retrieval features (e.g., cosine similarity, Hadamard product) are propagated into ranking, and feedback loops using human-labeled data close the gap between retrieval and ranking relevance. ANN infrastructure is optimized end-to-end, with parameters (number of clusters, scan breadth) tuned for business objectives and latency.

4. Beyond Text: Personalization, Multimodality, and Diversity

EBR has evolved from text-based retrieval toward supporting comprehensive, context-rich and multimodal scenarios. Personalized retrieval integrates multi-granular user signal—recent searches, purchases, session context—via attention mechanisms or graph embeddings (Jha et al., 2023, Li et al., 2023).

In Facebook and Taobao search, model architectures explicitly encode cascading objectives (e.g., exposure → click → purchase), and deployment supports real-time, scenario-driven objective weighting (Zheng et al., 2022, Deng et al., 17 Apr 2025). In content moderation, EBR supports visual and text modalities with multimodal encoders, leveraging supervised contrastive learning to align by risk, not appearance (Liang et al., 30 Jun 2025).

Newer divide-and-conquer frameworks employ clustering over the corpus, enabling parallel retrieval across clusters and controllable diversity/fairness in the final candidate pool (Zhang et al., 2023). This allows system designers to directly balance user interest coverage and accuracy.

5. Probabilistic and Robust Approaches

A limitation of classic EBR is the use of fixed thresholds (e.g., top-K) for all queries, which can under-retrieve for broad ("head") queries and over-retrieve for specific ("tail") queries. Probabilistic EBR (pEBR) models the distribution of similarity scores for each query and sets dynamic, query-specific retrieval thresholds by inverting the learned cumulative distribution function (CDF), resulting in improved recall and precision across the full query spectrum (Zhang et al., 25 Oct 2024).

Industrial systems increasingly use robust self-supervised multitask objectives (e.g., SSMTL), combining retrieval loss with auxiliary tasks such as CCA and masked autoencoding, yielding improved generalization and especially better service for cold-start and sparse-user regimes (Kolodner et al., 23 Sep 2024).

6. Evaluation, Metrics, and Impact in Large-Scale Systems

EBR is judged by recall@K, NDCG, precision, RPM (revenue per mille), and task-specific metrics (e.g., new friends made in social graphs, action rates in content moderation). Across Facebook, Microsoft Bing, Taobao, Etsy, Tencent, and Walmart, multi-million to billion-scale industrial deployments report significant improvements:

Increases of up to 18% recall and significant gains in transactions and revenue in e-commerce search (Huang et al., 2020, Zheng et al., 2022).
Up to 17.5% gain in recall@10 on billion-entry ad corpora (Xiao et al., 2022).
Uplifts of 5.45% in connection rates for friend recommendation (Kolodner et al., 23 Sep 2024).
10%+ increase in moderation actions and >80% reduction in operational trend-handling cost compared to pure classification (Liang et al., 30 Jun 2025).

Efficiency and scalability are maintained through judicious use of quantization, binarization, and distributed ANN techniques, with most systems remaining within tight serving latency bounds.

7. Directions and Advanced Topics

Recent work explores the integration of event-centric reasoning for real-time search (e.g., event triplet extraction with a decoder at train-time only (Zhang et al., 9 Apr 2024)), cascade selective masking for parameter-efficient multi-objective serving (Deng et al., 17 Apr 2025), and transformer-based retrieval with next-action prediction and multi-interest extraction (e.g., KuaiFormer (Liu et al., 15 Nov 2024)). These developments move EBR beyond static, vector-based exact matching toward a more flexible, adaptive, and context-aware foundation for information retrieval and recommendation at global scale.

Summary Table: EBR Variants and System Impact

Variant/Technique	Key Feature	Reported Impact/Use Case
Unified embedding frameworks	Personalized, multi-feature encoder	Facebook, Etsy product and people search
Bi-granular/sparse+dense	In-memory sparse + on-disk dense representations	Bing Ads, production web search
Cascade/multi-objective tuning	Sequential fine-tuning and selective parameter masking	Taobao, Alibaba, AliExpress, Walmart
Hard negative/ensemble mining	Balanced easy/hard negatives, ensembling for quality	Facebook, Etsy, Taobao, Walmart
Probabilistic retrieval (pEBR)	Query-specific, CDF-based thresholding	Improved recall/precision at all frequencies
Binary embedding engines	Binarization, version compatibility, SIMD acceleration	Tencent Sogou, QQ, Tencent Video
Transformer-based next-action	Sequence-awareness, multi-interest representation	Kuaishou, 400M DAUs, watch time uplift
Supervised contrastive learning	Risk-aware multimodal embeddings for retrieval	Video moderation, >80% ops cost reduction

Embedding-Based Retrieval methods have proven fundamental to modern industrial search and recommendation, with ongoing innovation enabling robust, efficient, and adaptive retrieval at unprecedented scale and complexity.