State-of-the-Art Retrieval Models

Updated 10 September 2025

State-of-the-Art Retrieval Models are advanced systems that leverage neural networks, contextual embeddings, and LLMs to enhance document retrieval precision.
They employ methodologies such as bi-encoder dense retrieval, late-interaction, and neural sparse models, demonstrating empirical performance gains on established benchmarks.
These models integrate techniques like query rewriting and cascading architectures to balance accuracy, efficiency, and scalability in complex search applications.

State-of-the-art retrieval models are advanced systems designed to retrieve relevant documents or items from large corpora in response to natural language queries. Over recent years, retrieval modeling has evolved from simple lexical matching schemes (e.g., BM25) to neural architectures that learn contextual semantics, employ multi-stage training and interaction mechanisms, and tightly integrate with LLMs. This entry surveys the core paradigms, principal architectures, recent empirical advances, evaluation results, and unresolved challenges in contemporary retrieval modeling.

1. Foundations and Evolution of Retrieval Modeling

Retrieval models traditionally began with term-matching algorithms, such as the vector space model and BM25. BM25 computes an exact-match relevance score between queries and documents using a normalized bag-of-words representation, balancing term frequency, inverse document frequency, and document length normalization:

$\text{score}(Q, D) = \sum_{q_i \in Q} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1-b + b \cdot (|D|/\text{avgdl}))}$

Sparse retrieval approaches, while efficient, are limited by their inability to bridge semantic gaps and are sensitive to vocabulary mismatches. The transition to neural retrieval saw the rise of models employing deep contextualized representations. Early architectures, including DSSM and ARC-I, encoded queries and documents to fixed-length vectors, but their performance lagged behind established baselines such as BM25 on ad-hoc retrieval tasks (Pang et al., 2016).

Advancements in pre-trained LLMs—such as BERT, T5, and ERNIE—catalyzed the adoption of dense retrieval, late-interaction, and neural sparse retrieval frameworks. These models are characterized by the use of transformer encoders trained with large-scale, multi-stage objectives and—more recently—the integration of LLMs for both query rewriting and direct relevance scoring (Liu et al., 2021, Oosterhuis et al., 16 Apr 2025, Wu et al., 8 Apr 2024, Feng et al., 2023, Killingback et al., 8 Sep 2025).

2. Major Neural Retrieval Paradigms

2.1 Bi-Encoder Dense Retrieval

Dense bi-encoder retrieval systems, exemplified by Dense Passage Retrieval (DPR), independently encode queries and documents into contextualized dense vectors using transformer backbones (e.g., BERT):

$E(q) = \text{Encoder}_Q(q), \qquad E(d) = \text{Encoder}_D(d)$

$\text{score}(q, d) = E(q)^\top E(d)$

Contrastive loss or negative sampling is used to encourage discrimination between relevant and non-relevant documents. These systems underpin retrieval pipelines in modern toolkits and search engines, with state-of-the-art effectiveness demonstrated on MS MARCO and BEIR (Sil et al., 2023, Farivar, 25 Aug 2025, Hambarde et al., 2023, Liu et al., 2021).

2.2 Late-Interaction and Multi-Vector Models

Late-interaction architectures (e.g., ColBERT, XTR (Lee et al., 2023)) represent queries and documents at the token level, preserving fine-grained alignment for improved expressiveness:

$f_{ColBERT}(Q, D) = \frac{1}{n} \sum_{i=1}^n \max_{1 \leq j \leq m} (q_i^\top d_j)$

Recent work with XTR shows that optimizing token retrieval directly (rather than scoring over all document tokens) achieves comparable or better performance than ColBERT while reducing inference cost by several orders of magnitude (Lee et al., 2023). These models are particularly effective in domains requiring precise term alignment, such as mathematical retrieval (Zhong et al., 2022).

2.3 Neural Sparse Retrieval and LSR Frameworks

Learned sparse retrieval (LSR) methods—such as SPLADE, DeepCT, and uniCOIL—train models to induce sparse lexical representations:

$w_{q} = f_Q(q), \quad w_{d} = f_D(d), \quad \text{sim}(q, d) = \sum_{i=1}^{|V|} w_i(q) \cdot w_i(d)$

These methods offer the efficiency of inverted index-based retrieval while leveraging neural models to assign importance weights to terms. FLOPs regularization (Nguyen et al., 2023) and document/query expansion strategies further optimize effectiveness and efficiency trade-offs.

Group D SPLADE-based models with full expansion and weighting achieve the highest mean reciprocal rank (MRR) on benchmarks such as MS MARCO, but work demonstrates that careful ablation (e.g., removing query expansion) can reduce retrieval latency by 74% with negligible impact on effectiveness (Nguyen et al., 2023).

2.4 Cross-Encoder Models and Global Interaction

Cross-encoder paradigms (e.g., MonoT5 (Farivar, 25 Aug 2025)) concatenate query and document as a single input to an encoder-decoder transformer, directly modeling inter-sequence attention:

$\text{Input:}\ \text{Query: } q\ \text{Document: } d\ \text{Relevant:}$

$\text{Target:}\ \text{true}/\text{false}$

These models achieve superior ranking accuracy due to rich query-document interaction, but incur prohibitive cost at scale, restricting them to re-ranking scenarios (Farivar, 25 Aug 2025, Abdallah et al., 22 Aug 2025).

3. LLM Integration and Compound Retrieval Systems

3.1 LLM-Augmented Retrieval

Recent advances integrate LLMs both as generators of synthetic context (e.g., doc-level embeddings, query rewriting) and as direct rerankers or relevance predictors in compound retrieval systems (Oosterhuis et al., 16 Apr 2025, Wu et al., 8 Apr 2024, Abdallah et al., 22 Aug 2025, Killingback et al., 8 Sep 2025). Augmentation strategies include:

Generating synthetic queries and titles per document via LLMs, used to enhance doc-level representations for bi-encoder and token-level models; recall@3 and recall@10 are significantly improved on LoTTE and BEIR datasets (Wu et al., 8 Apr 2024).
Iteratively refining input queries and ranking order in frameworks such as InteR, combining retrieval model (RM) and LLM outputs via prompt enrichment and retrieval-augmented demonstrations (Feng et al., 2023).

3.2 Compound and Cascading Architectures

Compound retrieval systems generalize classic cascades by learning selection policies and aggregation functions to combine outputs from multiple predictive models (e.g., BM25, LLM pointwise, and pairwise predictors) (Oosterhuis et al., 16 Apr 2025). These systems:

Optimize a loss function that interpolates effectiveness (e.g., nDCG) and computational cost, balancing the number of LLM inference calls with ranking quality.
Permit flexible design, where pointwise and pairwise LLM predictions are selectively acquired and aggregated via learned, differentiable functions. Up to an order of magnitude reduction in LLM call cost is achieved for comparable nDCG relative to standard cascades.

Empirical paper shows that compound systems outperform both pointwise and pairwise cascades on TREC re-ranking tasks, and can be optimized even in self-supervised settings via ranking distillation losses.

4. Retrieval for Multimodal and Complex Information Needs

4.1 Multimodal Retrieval

Contemporary systems incorporate multimodal retrieval by aligning text and other modalities—such as images and speech—within a shared embedding space. Examples include CLIP and BLIP for lifelog and image retrieval (Tran et al., 7 Jun 2025), and multi-modal retrieval models for speech recognition via kNN-LM and cross-attention adapters (Kolehmainen et al., 13 Jun 2024). The latter achieves up to 50% reduction in word error rate on the Spoken-Squad dataset over non-retrieval-augmented LMs.

4.2 Complex Query Handling and Limitations

Benchmarks designed for complex retrieval tasks—where queries have multiple constraints, logical operations, or domain-specific requirements—reveal the limitations of current state-of-the-art models (Killingback et al., 8 Sep 2025). In CRUMB, the highest nDCG@10 achieved is only 0.346 with an R@100 of 0.587, even by top-tier models such as GTE Qwen 7B and Promptriever. LLM-based augmentation (query rewriting or expansion) yields mixed outcomes: some gains for weaker models but decreased performance for the strongest, suggesting sensitivity to distribution shifts introduced by rewriting.

5. Comparative Evaluation, Challenges, and Future Directions

Empirical comparisons demonstrate that no single model family yet dominates all aspects of retrieval. Key comparisons include:

Model Type	Effectiveness (nDCG/MRR)	Scalability	Robustness/Challenges
Sparse (BM25, TF-IDF)	Robust baseline	Very high	Handles lexical queries; limited on semantic
Dense bi-encoder (DPR, ERNIE)	Superior to sparse (common queries)	High (ANN)	Struggles on complex/multi-faceted queries
Late-interaction (ColBERT, XTR)	High (token-level nuance)	Moderate	Higher memory, more compute
Neural sparse (SPLADE)	State-of-the-art (efficient)	High (inverted index)	Careful regularization/weighting crucial
Cross-encoder (MonoT5)	Highest accuracy (rerank)	Low (costly)	Not scalable w/o cascade/compound sys
LLM-based rerankers (RankGPT, PRP, etc.)	Highest on familiar queries	Low-Moderate	Generalization to novel queries is limited
Compound/Hybrid (BM25+LLM, InteR, etc.)	Best trade-off	High (if optimized)	Design complexity, tuning cost metrics

Challenges for current and future research include:

Improving generalization to complex, multi-aspect, and low-resource queries, as evidenced by subpar results on CRUMB (Killingback et al., 8 Sep 2025).
Addressing scalability in models with expensive interaction mechanisms (ColBERT, Cross-Encoder) or high LLM call costs (pairwise LLMs, listwise rerankers).
Enhancing robustness against training data overlap; performance drops by up to 15% are observed in FutureQueryEval for LLM rerankers benchmarked on unseen queries (Abdallah et al., 22 Aug 2025).
Integrating and optimizing multimodal signals, event-level coherence, and more effective context utilization (including via advanced doc-level embeddings, adaptive negative sampling, and better field aggregation) (Wu et al., 8 Apr 2024, Tran et al., 7 Jun 2025).
Developing new evaluation protocols and open benchmarks specifically tailored for complex, realistic retrieval scenarios (Killingback et al., 8 Sep 2025).

6. Practical Applications and System Integration

State-of-the-art retrieval systems are broadly deployed in web search (e.g., Baidu with ERNIE (Liu et al., 2021)), question answering (PrimeQA (Sil et al., 2023)), cross-modal search (LSC (Tran et al., 7 Jun 2025)), legal and biomedical retrieval (CRUMB (Killingback et al., 8 Sep 2025)), speech recognition (Kolehmainen et al., 13 Jun 2024), and more. System architectures rely on staging retrieval (first-stage fast recall, second-stage precision), and hybridize sparse, dense, and LLM-based models with pipeline management for efficiency and user experience (Oosterhuis et al., 16 Apr 2025, Mozolevskyi et al., 3 May 2024). Emerging frameworks such as compound retrieval and LLM augmentation are providing improved trade-offs between accuracy, responsiveness, cost, and interpretability in large-scale settings.

Contemporary retrieval models encompass a broad spectrum of architectures from classic sparse methods to cross-encoder, token-interaction, and LLM-augmented frameworks. While significant progress has been made in recall, precision, and flexibility, rigorous benchmark evaluations reveal that notable gaps remain for complex and multifaceted queries, especially in generalization and cost-effectiveness. Ongoing research on compound system optimization, multimodal integration, and context-enriched embedding is expected to drive the next wave of innovation in this pivotal area of information access.