Candidate Retrieval: Dense and Hybrid Search

Updated 17 February 2026

Candidate retrieval is the process of selecting a small set of potentially relevant items from a vast collection using methods like dense, sparse, and hybrid techniques.
Dense retrieval maps queries and documents to continuous vectors for similarity search, while sparse methods rely on term matching and hybrid approaches combine both for improved performance.
Practical systems fuse candidate lists using techniques such as reciprocal rank fusion and weighted sum scoring to optimize recall, robustness, and efficiency in multi-stage pipelines.

Candidate retrieval is the process of selecting a small set of potentially relevant items, documents, or responses from a large collection in response to a query. This forms the first stage of multi-stage information retrieval pipelines, including web search, recommender systems, and retrieval-augmented generation (RAG) frameworks. Modern candidate retrieval is dominated by dense, sparse, and hybrid retrieval approaches. Dense methods use neural embeddings to compute similarity in a latent space, while sparse methods rely on lexical overlap; hybrid approaches seek to unify these paradigms for superior recall, diversity, and resilience across query and data types.

1. Retrieval Paradigms: Sparse, Dense, and Hybrid Approaches

Sparse Retrieval

Sparse retrieval leverages term-based, inverted index structures. Classical full-text search uses the Okapi BM25 scoring function: $\mathrm{BM25}(q,D)=\sum_{t\in q} \log\frac{N-n(t)+0.5}{n(t)+0.5} \cdot \frac{(k_1+1)f(t,D)}{k_1(1-b+b|D|/\mathrm{avgdl}) + f(t,D)}$ where $N$ is corpus size, $n(t)$ number of documents containing term $t$ , $f(t,D)$ frequency in $D$ , $|D|$ doc length, and $\mathrm{avgdl}$ is mean doc length (Mala et al., 28 Feb 2025, Wang et al., 2 Aug 2025).

Learned sparse models (SPLADE, uniCOIL) output high-dimensional, sparse bag-of-words vectors but can also be indexed via inverted files (Lin et al., 2022).

Dense Retrieval

Dense methods map queries and documents to low-dimensional continuous vectors using dual-encoder Transformer models. Similarity is measured via cosine or inner product: $\mathrm{sim}(q,D) = \frac{E(q) \cdot E(D)}{\|E(q)\|\,\|E(D)\|}$ Approximate nearest neighbor search (ANNS), such as HNSW graphs or IVF-PQ, enables scalable sublinear search over millions of vectors (Ma et al., 2023, Zhang et al., 2024).

Hybrid Retrieval

Hybrid retrieval integrates signals from both paradigms. Typical techniques:

Fusion of Candidate Lists: Union, intersection, or reciprocal rank fusion (RRF) merges top- $k$ lists (Mala et al., 28 Feb 2025, Wang et al., 2 Aug 2025).
Score Integration: Weighted sum, linear combination, or learned rescoring models (e.g., LambdaMART, XGBoost ensembles) combine BM25 and dense scores (Dadas et al., 2024, Sultania et al., 2024).
Unified Index Structures: Single-index methods like HI² combine cluster and term selectors for one-pass retrieval (Zhang et al., 2022). Hybrid retrieval consistently yields improved recall, robustness to query type (lexical vs. semantic vs. entity-rich), and resilience to domain shifts.

2. Fusion Mechanisms and Adaptive Combination Strategies

Rank-Based: Reciprocal Rank Fusion (RRF)

Let documents have ranks $N$ 0 from each retriever. RRF combines them by: $N$ 1 This method is agnostic to the scale of underlying scores and robust to outliers (Mala et al., 28 Feb 2025, Wang et al., 2 Aug 2025). Dynamically weighted RRF can further adapt to per-query characteristics, e.g., specificity weights computed as the average tf–idf of expanded queries (Mala et al., 28 Feb 2025).

Score-Based: Weighted Sum and Linear Models

Hybrid scores can be computed with learnable or manually tuned weights: $N$ 2 Weights may be set via validation or learned end-to-end as in gradient-boosted regression tree models (Dadas et al., 2024, Sultania et al., 2024).

Advanced Fusion: Tensor-Based and Token-Level Models

Tensor-based fusion applies late-interaction mechanisms similar to ColBERT or TRF, computing token–token max-sim interactions but only over a reduced candidate pool, balancing the semantic coverage of token-level search with the resource consumption of dense retrieval (Wang et al., 2 Aug 2025).

3. System Architectures: Index Structures, Efficiency, and Scalability

Separate Indexing and Parallel Querying

Most candidate retrieval stacks index sparse and dense representations separately, using Lucene/Anserini for inverted indexes and Faiss/Lucene-HNSW for dense ANN (Ma et al., 2023). Queries are issued concurrently, and results are merged/fused afterward.

Unified or Hybrid Indices

HI² and recent graph-based hybrid indices enable a joint retrieval of candidates via both clustering (semantic) and term-based (lexical) inverted lists, achieving lossless first-stage recall with low latency (Zhang et al., 2022). Graph-based HNSW can store both dense and sparse (CSR-formatted) vectors, employing two-stage search: coarse dense-only traversal, then hybrid (dense+aligned sparse) scoring in the beam refinement phase (Zhang et al., 2024).

Complexity and Cost

Hybrid search increases candidate set size (up to 2x), memory (multiple index structures), and per-query computation (fusion overhead). Recent systems report:

Index sizes: Dense (768-dim float16, 8.8M passages) $N$ 3 8–28 GB (Ma et al., 2023, Lin et al., 2022).
Latency: 30–90 ms/query for hybrid RRF or weighted sum; token-level fusion or reranking can add 10–20 ms for $N$ 4 (Wang et al., 2 Aug 2025).

Efficiency enhancements include pruning low-weight sparse terms (Zhang et al., 2024), dynamic candidate truncation (keeping top 200–300 docs for re-ranking) (Macdonald et al., 2021), and parallel retrieval with subsequent deduplication/aggregation (Dadas et al., 2024).

4. Empirical Evaluations: Quality, Robustness, and Downstream Impact

Retrieval Effectiveness

Hybrid retrieval outperforms pure methods on a wide range of metrics:

On HaluBench, a hybrid system achieved MAP@3 = 0.897, NDCG@3 = 0.915 versus best single retriever (MAP@3 = 0.768, NDCG@3 = 0.783) (Mala et al., 28 Feb 2025).
On the PIRB Polish benchmark (41 tasks), hybrid methods outperformed BM25 by 17.3 NDCG@10 points and dense methods by up to 9 points after distillation and fine-tuning (Dadas et al., 2024).
On BEIR, hybrid approaches with LLM-driven feedback achieved the new zero-shot state-of-the-art (e.g., NDCG@10 = 47.0 with ReDE-RF) (Jedidi et al., 2024).

Downstream LLM and QA Impact

Hybrid retrieval directly mitigates LLM hallucinations in RAG: hallucination rate dropped to 9.38% (hybrid) from 21.2–28.9% (pure) and LLM answer accuracy on fails rose to 80.4% (hybrid) on the HaluBench subset (Mala et al., 28 Feb 2025). Domain-specialized QA enjoyed additional benefits by incorporating metadata-based boosts (e.g., host/domain prior) (Sultania et al., 2024).

Resource and Cost Trade-offs

Hybrid architectures increase recall but incur resource and complexity trade-offs. The "weakest link" effect is significant: poor-quality retrieval from any path can degrade overall accuracy more than it helps, necessitating path-wise validation before including a retriever in the ensemble (Wang et al., 2 Aug 2025).

Diversity, Cold-Start, and Multi-Interest

Hybrid and multi-interest candidate retrieval (e.g., kNN-Embed) delivers not only improved recall but also higher response set diversity by explicitly modeling user intent as a mixture over item clusters (El-Kishky et al., 2022). Hybrid re-ranking (e.g., via LIGER) addresses the cold-start problem in sequential recommendation by enabling coverage of both seen and previously unseen items (Yang et al., 2024).

5. Advanced Design Patterns and Practical Considerations

Query Expansion and Adaptivity

Lightweight query expansion, such as adding top-2 synonyms per term from WordNet, effectively closes the lexical gap and boosts hybrid recall, especially for underspecified queries (Mala et al., 28 Feb 2025). Query specificity (average tf–idf or similar) is used for adaptive weighting between dense and sparse retrievers per query.

Adaptive and Learned Fusion

Fusion weights and even retriever selection can be adapted per-query using classifiers over the query content or retrieval results (e.g., BERT-based selectors for "sparse," "dense," or "hybrid" choice) (Arabzadeh et al., 2021). LambdaMART and XGBoost can learn to post-normalize and combine dense/sparse features for robust hybrid reranking (Dadas et al., 2024).

Interpretability and Transparency

Hybrid models that output interpretable sparse terms (e.g., expansions) support retrieval rationales, critical in high-stakes QA and enterprise settings (Biswas et al., 2024). Search agents in hybrid environments enable explicit, interpretable multi-step query refinements (Huebscher et al., 2022).

Scalability and Throughput

Hybrid ANN methods with adaptive coarse-to-fine search, vector pruning, and index unification achieve order-of-magnitude improvements in QPS compared to naïve two-stage or concatenated approaches, without loss of recall (Zhang et al., 2024).

6. Best Practices and Guidelines

Combine exact lexical retrieval (for entities, phrase queries, and granular control) with dense semantic search (for paraphrase and synonym generalization).
Use query expansion and per-query adaptivity (dynamic weighting) to increase hybrid effectiveness, but restrict expansion to a small number (e.g., top-2 synonyms) to avoid dilution (Mala et al., 28 Feb 2025).
Implement hybrid score fusion with robust methods—prefer RRF for scale-invariant aggregation, or weighted sum/learned models if proper normalization is feasible (Wang et al., 2 Aug 2025).
Build and maintain both dense and sparse indices in parallel, ensuring each is tuned for per-query latency and recall requirements (Ma et al., 2023, Arabzadeh et al., 2021).
For production, index both dense and sparse representations offline, perform query formulation and weighting online, and design for sub-100 ms total retrieval+fusion latency on moderately sized hardware (Mala et al., 28 Feb 2025, Sultania et al., 2024).
For multi-lingual or domain-specialized settings, use retrieval-wise knowledge distillation from strong source-LLMs and apply lightweight hybrid rerankers to bridge domain or language gaps (Dadas et al., 2024).
Avoid adding weak retrieval paths into the hybrid fusion (the "weakest link" effect) without isolated validation (Wang et al., 2 Aug 2025).

Optimal candidate retrieval relies on careful selection, weighting, and fusion of dense and sparse paths, under rigorous cost–quality constraints, with attention to downstream QA, RAG, or recommendation application requirements. The trend is toward increasingly explicit, learned, and unified hybrid retrieval architectures that fuse semantic generalization with the transparent control of symbolic methods, yielding robust, efficient, and highly effective candidate sets.