Lexical Reranking in Semantic Retrieval (LeSeR)
- The paper introduces LeSeR, a hybrid model that combines BM25’s exact term matching with dense semantic representations to improve both recall and mAP.
- LeSeR employs a two-stage process: first retrieving semantically similar candidates using dense vectors, then reranking them based on BM25 scores for precision.
- Empirical results demonstrate that LeSeR outperforms single-modality methods, achieving up to a 16.3% nDCG improvement in high-stakes domains.
Lexical Reranking of Semantic Retrieval (LeSeR) is a hybrid retrieval paradigm that integrates classical lexical matching (typically BM25) with dense semantic representations to enhance both recall and precision in information retrieval. LeSeR has become central to workflows in both traditional and LLM-based pipelines, particularly where retrieval-augmented generation or domain-sensitive precision are required. By explicitly combining dense vector similarity with exact term match scores during reranking, LeSeR consistently outperforms single-modality retrieval systems for regulatory, general QA, scientific, and medical reasoning tasks (Purbey et al., 2024, Gao et al., 2020, Kuzi et al., 2020, Seetharaman et al., 17 Jun 2025).
1. Motivation and Core Principles
The limitations of dense semantic retrieval—primarily, its tendency to smooth over exact token matches and potentially "hallucinate" relevance for textually divergent yet conceptually similar passages—create vulnerabilities in high-stakes domains that rely on specific statutory or technical terminology. Conversely, lexical retrieval such as BM25 excels at matching key terms but suffers substantial recall drops for paraphrastic or idiomatic queries. LeSeR targets this complementarity: dense retrieval ensures high recall by retrieving semantically relevant passages, after which BM25 scoring reintroduces precision via lexical reranking, capturing relevance gated on exact term presence (Purbey et al., 2024, Kuzi et al., 2020, Seetharaman et al., 17 Jun 2025).
This mechanism is critical in regulatory and scientific domains, where high mAP and recall must both be achieved, and where naive dense or sparse methods alone fail to meet user or task-level requirements.
2. System Architecture and Algorithmic Framework
LeSeR can be instantiated in both classical reranker architectures and LLM-centric pipelines. A canonical, non-LLM architecture, as described in the RIRAG shared task (Purbey et al., 2024), consists of the following stages:
- Preprocessing & Indexing: Input corpora are segmented into passages, which are indexed with both FAISS (for dense search) and an inverted index (for BM25).
- Semantic Candidate Retrieval: Given query , a fine-tuned embedder produces and retrieves top- (e.g., 20) passages by cosine similarity using the FAISS index.
- Lexical Reranking: For each candidate from the top- list, compute as the BM25 score:
Then assign the final LeSeR score,
with tuned on dev data (typically ).
- Selection for Downstream QA: Return the top- (e.g., 10) reranked passages for next-stage answer extraction or generation.
This structure is retained, modulo adaptation, for systems that further leverage LLMs for listwise reranking, such as InsertRank (Seetharaman et al., 17 Jun 2025). There, (query, document, BM25) tuples are injected into a listwise prompt, and the LLM outputs a relevance ranking, using BM25 as an explicit lexical cue during its step-by-step rationale.
3. Semantic and Lexical Scoring Components
LeSeR is predicated on dual scoring:
- Semantic Similarity: Embedding models —selected from architectures such as BGE, Stella, MPNet, or domain-adapted derivatives—are trained via Multiple Negative Symmetric Ranking (MNSR) loss or analogous contrastive regimes (Purbey et al., 2024, Gao et al., 2020). Embeddings yield
- Lexical Similarity (BM25): BM25 provides a score derived from term frequency, inverse document frequency, length normalization, and smooth hyperparameters . This captures direct string overlap, essential in domains where term presence precisely dictates relevance.
When used in an LLM reranking pipeline, the BM25 signal is concatenated to each candidate as a normalized scalar feature, to which the LLM can directly attend during listwise evaluation (Seetharaman et al., 17 Jun 2025).
4. Empirical Evaluation and Quantitative Results
Multiple studies benchmark LeSeR and variants of hybrid retrieval on established datasets. In the RIRAG test set for regulatory QA (Purbey et al., 2024):
| System | Recall@10 | mAP@10 |
|---|---|---|
| BM25 baseline | 0.7611 | 0.6237 |
| BGE (sem. OT) | 0.7040 | 0.0960 |
| BGE (MNSR-tuned) | 0.8068 | 0.1077 |
| BGE_LeSeR | 0.8201 | 0.6655 |
Fine-tuning BGE on MNSR boosts recall ~10 points (0.704 → 0.807), but mAP remains low until BM25-based reranking is applied, whereupon mAP jumps by >0.55, surpassing both pure semantic and pure lexical baselines.
InsertRank (Seetharaman et al., 17 Jun 2025), a listwise LLM-based instantiation of LeSeR, further demonstrates improvements across models and corpora:
| Model | BRIGHT nDCG@10 | R2MED nDCG@10 |
|---|---|---|
| BM25 only | 0.148 | 0.377 |
| LLM (no BM25) | 0.334–0.372 | 0.480–0.525 |
| + LeSeR (raw BM25) | 0.342–0.375 | 0.484–0.537 |
This pattern recurs over both classical listwise reranking and LLM-augmented methods: introducing explicit lexical relevance signals consistently yields relative gains (+2–3% nDCG, up to +16.3% on specific domains).
5. Algorithmic Variants and Design Choices
LeSeR admits several architecturally distinct implementations:
- Post-hoc Linear Interpolation: Lexical and semantic scores combined post-retrieval using fixed or grid-searched weights, as in (Purbey et al., 2024, Kuzi et al., 2020).
- Adaptive or Learned Fusion: Use of learning-to-rank atop or inclusion of additional hand-crafted, reciprocal rank, or structural features (Kuzi et al., 2020, Purbey et al., 2024).
- Listwise LLM Reranking ("InsertRank"): BM25 scores explicitly injected into LLM prompts. LLM performs holistic, side-by-side ranking with lexical and semantic information visible (Seetharaman et al., 17 Jun 2025).
Ablation studies confirm that reintroducing lexical scores—even as simple normalized scalars—prevents certain failure modes of dense retrieval, e.g., when embedding models under-rank passages with rare but crucial terms.
6. Limitations, Failure Modes, and Future Research Directions
LeSeR's decoupled approach displays residual limitations:
- Semantic-only Recall versus mAP: Dense retrievers, even when fine-tuned, exhibit low mAP if not reranked lexically (Purbey et al., 2024).
- Over-reliance on BM25: Injecting high BM25 scores can cause rerankers—including LLMs—to over-emphasize boilerplate or repetitive text, especially if domain-specific noise is present (Seetharaman et al., 17 Jun 2025).
- Order Sensitivity: Prompt-based reranking with LLMs is sensitive to candidate order, though BM25 injection mitigates but does not eliminate this (Seetharaman et al., 17 Jun 2025).
- Domain Adaptation: Dense models may require careful domain-specific pretraining/adaptation to avoid semantic drift or loss of coverage for regulatory jargon.
Future directions include dynamic, per-query weighting of signals (potentially LLM-predicted or reinforcement-learned), integration with supplementary features (e.g., structural cues, section headings), end-to-end learning-to-rank atop multimodal signals, and comprehensive error-driven reranker training (Purbey et al., 2024, Seetharaman et al., 17 Jun 2025).
7. Broader Context and Related Hybrid Models
LeSeR exemplifies a broader trend toward tightly integrated hybrid retrieval. Comparable approaches include CLEAR (Gao et al., 2020), which induces residual embeddings to directly complement BM25, and hybrid retrieval via parallel, union, or pseudo-relevance feedback (e.g., RM3). Empirical evidence across newswire, regulatory, scientific, and medical QA benchmarks consistently demonstrates substantial improvements in recall, mAP, and nDCG from these hybrid strategies (Kuzi et al., 2020, Gao et al., 2020, Purbey et al., 2024, Seetharaman et al., 17 Jun 2025).
A plausible implication is that lexical reranking of fixed-size semantic candidate sets (as in LeSeR) offers a practical and robust tradeoff between cost and retrieval effectiveness, enabling both high coverage (via dense models) and high precision (via exact matching). The injection of signals into LLM inference provides an additional avenue for soft-constraint enforcement at scale.