BM25-Based Few-Shot Retrieval

Updated 19 July 2025

BM25-based few-shot retrieval is a methodology that combines robust lexical matching with adaptable neural enhancements for low-resource settings.
It employs query augmentation and reweighting strategies to boost retrieval performance when labeled data is scarce.
Hybrid techniques, including LLM-driven re-ranking and zero-shot answer scent generation, further improve search accuracy while maintaining efficiency.

BM25-based few-shot retrieval refers to a family of information retrieval methodologies that combine the robust lexical matching and computational efficiency of the BM25 algorithm with strategies for adaptation, augmentation, or enhancement in low-resource or few-shot settings. BM25’s established position as an efficient and effective sparse term-matching method provides a strong baseline—especially where labeled examples are limited. Recent research explores methods that either improve, hybridize, or augment BM25 to leverage limited supervision, integrate neural models, or inject reasoning via LLMs to improve retrieval quality in few-shot scenarios.

1. Fundamentals of BM25 in Few-Shot and Zero-Shot Retrieval

BM25 is a probabilistic term-matching algorithm that scores a document $d$ with respect to query $q$ through a weighted sum of term overlaps, modulated by parameters reflecting document length and term frequency-inverse document frequency (IDF). The core scoring function for document $d$ and query $q$ with vocabulary $V$ is:

$\mathrm{BM25}(q, d) = \sum_{i \in q} \mathrm{IDF}(i) \cdot \frac{\mathrm{tf}(i, d) \cdot (k_1 + 1)}{\mathrm{tf}(i, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\mathrm{avgdl}})}$

Despite its age, BM25 remains highly competitive, offering strong out-of-domain generalization and low latency—on collections of approximately 1 million documents, typical index sizes are often less than 0.4GB and retrieval latencies are minimal (Thakur et al., 2021). Empirical evaluations across heterogeneous benchmarks (e.g., BEIR’s 18 datasets) confirm that BM25’s lexical matching is robust in zero-shot settings and frequently outperforms dense neural models trained in-domain but evaluated out-of-domain. This robustness stems from BM25’s insensitivity to distributional drift in the key vocabulary.

2. Enhancement Strategies: Query Augmentation and Re-Weighting

Recent work proposes learning to improve BM25 by augmenting and reweighting the query’s sparse representation via neural models while retaining the speed and compatibility of inverted-index search (Chen et al., 2023). This is achieved by introducing a learned (continuous) augmentation vector $a(q)$ and weighting vector $w(q)$ generated from a query encoder. The scoring function becomes:

$\text{score}(q, d) = (w(q) \odot v \odot (\operatorname{bow}(q) + a(q)))^\top f(d)$

Where:

$v$ is the IDF vector,
$\operatorname{bow}(q)$ is the binary bag-of-words vector for $q$ ,
$f(d)$ is the BM25 document-term frequency vector,
$\odot$ denotes element-wise multiplication.

The model is trained end-to-end using a contrastive loss, and efficiency is ensured by controlling the sparsity of $a(q)$ . This approach yielded top-5 accuracy gains of over 12 percentage points on Natural Questions (NQ) compared to vanilla BM25 and demonstrated strong transfer: learned augmentations and weights enabled direct performance improvements on unseen datasets such as TriviaQA and EntityQuestions, with gains of 2–3 percentage points over BM25. The entire augmentation operates exclusively on the query, leaving the document index unmodified, making deployment on top of existing BM25-based systems practical.

3. Hybrid and Interpolated Ranking: BM25 with Neural and Contextualized Models

BM25’s lexical matching is often complementary to neural models’ semantic matching, especially in settings where queries and documents have limited or no lexical overlap. Studies in retrieval for Query-by-Example (QBE)—where queries are long and semantically rich—demonstrate that BM25 competes favorably with contextualized term-based neural models like TILDE and TILDEv2 (Abolghasemi et al., 2022). Linear interpolation of BM25 with contextualized model scores, following z-score normalization, results in statistically significant improvements across multiple QBE benchmarks:

$s(q, d) = \alpha \cdot s_{\text{BM25}}(q, d) + (1-\alpha) \cdot s_{\text{contextualized}}(q, d)$

The interpolation parameter $\alpha$ is tuned on a validation set. This hybrid strategy reliably outperforms either component alone, highlighting the non-redundant nature of lexical and contextual signals—a key insight for few-shot and low-supervision domains.

4. Integration of Relevance Feedback and Meta-Learning

BM25 can serve as the first-stage retriever in a pipeline where relevance feedback (e.g., user clicks or explicit judgments) is available. Methods such as BM25-QE (query expansion using relevant feedback), followed by neural re-ranking (either kNN-based in the embedding space or via cross-encoder fine-tuning), enable effective adaptation from only a handful of feedback documents (Baumgärtner et al., 2022). The process typically involves these steps:

Run BM25 retrieval and collect top-K documents as candidates.
Gather a small set of judged relevant/non-relevant documents ( $k$ per query).
Use the relevant set to expand the query lexically or fine-tune the neural re-ranker (often in a parameter-efficient way, such as bias-only updates).
Fuse the reranked list with BM25-QE using rank fusion methods (e.g., Reciprocal Rank Fusion).

Notable findings include improvements of approximately 5.2% in nDCG@20 over BM25-QE alone, with further gains as more feedback is incorporated. This approach leverages BM25’s strengths in candidate generation and the neural model’s ability to rapidly tailor to new relevance criteria with minimal data.

5. BM25 and Listwise Reranking via LLMs

Recent work demonstrates that LLMs, when equipped with the explicit BM25 scores for each candidate document (in addition to document texts and the query), can reason more effectively about document relevance (Seetharaman et al., 17 Jun 2025). The InsertRank approach structures the LLM prompt as an ordered list of document–BM25 score pairs, enabling models like GPT-4 or Gemini to ground their step-by-step reasoning in both semantic and lexical relevance. Empirical evaluation on BRIGHT and R2MED benchmarks shows improvements in NDCG@10 by up to 16.3% for Gemini 2.5 flash; scaling BM25 scores to a 0–100 range provides marginal additional benefit. Maintaining the original BM25 ordering in the prompt (i.e., not shuffling) ensures optimal effectiveness. The injection of BM25 scores is especially valuable in few-shot settings, anchoring the LLM’s ranking decisions even when labeled data is scarce.

6. Zero-Shot and Answer Scent Re-Ranking

BM25 candidate retrieval feeds into advanced reranking techniques based on LLMs’ ability to generate an “answer scent”—a short text capturing the prototypical answer form for a query (Abdallah et al., 25 Jan 2025). ASRank combines the initial BM25 candidate list with a zero-shot generated answer scent and reranks documents using a cross-attention model (e.g., T5-base) that scores alignment between each candidate’s answer and the answer scent. The method formulates the document score $s(d_i)$ as:

$s(d_i) = \sum_{t=1}^{|a|} -\log p(a_t \mid a_{<t}, d_i, q, \mathcal{S}(q); \theta_2)$

This mechanism substantially improves Top-1 accuracy, nearly doubling BM25’s performance on datasets like Natural Questions (from 22.1% to 47.3%), while requiring no task-specific fine-tuning. The architecture demonstrates the gains possible by integrating lexical retrieval with LLM-based zero-shot semantic reasoning.

7. Practical Implications and Limitations

Generalization and Adaptability: BM25 remains highly robust to domain and distribution shifts, making it an excellent backbone in few-shot regimes, especially when out-of-domain queries are frequent (Thakur et al., 2021). Its static term statistics provide advantageous generalization in the absence of dense in-domain training.
Efficiency: Sparse, inverted-index retrieval enables low-latency operation, and query-side augmentation does not burden document indexing (Chen et al., 2023).
Hybrid Systems: Combining BM25 with neural models (via interpolation, rank fusion, or embedding feedback) yields consistent performance improvements and can be tuned for computational resource budgets.
Limitation: Pure BM25 may miss non-lexical (semantic) matches; methods that inject neural signal—via augmentation, feedback, or LLM semantic reasoning—are particularly valuable in semantic, multi-hop, or cross-lingual tasks.

Summary Table: Key BM25-Based Few-Shot Retrieval Methods

Approach	BM25 Role	Few-Shot Adaptation Mechanism
Query Augmentation & Reweighting (Chen et al., 2023)	Core + Query-side enhancement	Learned augmentation vector $a(q)$ and reweighting $w(q)$ via neural encoder
Contextual-Hybrid Interpolation (Abolghasemi et al., 2022)	Score fusion	Linear interpolation with neural contextual signals
Relevance Feedback Re-Ranking (Baumgärtner et al., 2022)	Candidate generator	Query expansion and neural re-ranker fine-tuned on few judged docs
InsertRank LLM Listwise Reranker (Seetharaman et al., 17 Jun 2025)	Input feature for LLM	LLM prompt includes (doc, BM25 score); model reasons over both
ASRank Zero-Shot Answer Scent (Abdallah et al., 25 Jan 2025)	First-stage retrieval	Re-ranking using LLM-generated answer scent and cross-attention

BM25-based few-shot retrieval encompasses a flexible ecosystem, where efficient lexical retrieval anchors the search, and adaptive mechanisms—ranging from neural query augmentation to LLM-driven listwise re-ranking—successfully enhance efficacy in scenarios with limited supervision or rapid domain shift. This evolving methodology demonstrates that classical sparse methods remain foundational, especially when carefully integrated with contemporary neural and meta-learning approaches.