BM25-Based Few-Shot Retrieval

Updated 1 July 2025

BM25-Based Few-Shot Retrieval is an information retrieval technique that uses the BM25 lexical matching algorithm to effectively guide retrieval with only a few labeled examples.
It employs plug-and-play indexing and retrieval-augmented example selection to adapt to tasks like intent classification and query-by-example, achieving significant accuracy improvements.
Hybrid approaches integrate BM25 with semantic signals, reducing computational overhead while enhancing performance in out-of-domain and dynamic low-resource environments.

BM25-Based Few-Shot Retrieval refers to information retrieval techniques that leverage the BM25 lexical matching algorithm within few-shot learning scenarios, where only a small number of labeled examples guide retrieval or adaptation to new tasks, domains, or classes. Such approaches are highly relevant in data-constrained, dynamically evolving, or heterogeneous environments where labeled resources are scarce and task orientation is uncertain. The strategy encompasses both classic BM25 pipeline designs and their modern extensions, as well as their role as a robust baseline and component within hybrid systems for few-shot, zero-shot, and rapid adaptation contexts.

1. Core Principles and Methodologies

BM25 is a token-based probabilistic retrieval function optimizing term frequency, inverse document frequency (IDF), and document length normalization:

$\mathrm{BM25}(q, D) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avg}(|D|)})}$

where $f(q_i, D)$ is the term frequency, $k_1$ and $b$ are hyperparameters, and $|D|$ is the document length.

In few-shot retrieval, BM25 is exploited in two primary modes:

Plug-and-Play Indexing: New domains, classes, or labels are incorporated by simply adding labeled examples to the BM25 index, without retraining. Both intent classification and slot filling can be adapted through this mechanism, though BM25 is stronger for intent classification due to its reliance on lexical overlap (2104.05763).
Retrieval-Augmented Example Selection: For few-shot in-context learning or demonstration retrieval, BM25 finds the most similar prior samples (e.g., input-label pairs) to be presented alongside test queries, enhancing model performance in downstream few-shot tasks such as LLM-based extraction (2408.04665).

2. Application Domains and Task Adaptability

BM25-based few-shot retrieval techniques have been applied in a variety of settings:

Intent Classification and Slot Filling: Indexed labeled utterance spans are retrieved with BM25 to transfer slot or intent labels in few-shot setups. The approach is particularly effective for tasks with low paraphrasing and high surface-form similarity (2104.05763).
Query-by-Example (QBE): BM25 delivers strong results in QBE, where the query is a full document (e.g., scientific abstract), often matching or outperforming contextualized neural models like TILDE or TILDEv2, particularly for long or complex queries (2210.05512).
Retrieval-Augmented Generation (RAG) with LLMs: BM25 is used in RAG settings to select demonstrations for in-context few-shot prompting. In scientific text extraction (e.g., MOF synthesis conditions), BM25-RAG offers significant improvements in F1 extraction accuracy and downstream regressions ( $R^2$ ) over random or semantic retrieval, favoring tasks with distinct domain vocabulary (2408.04665).
Heterogeneous, Zero-Shot, and Entity-Centric Tasks: BM25 offers exceptional out-of-domain generalization capacity and negligible adaptation overhead in heterogeneous benchmarks (e.g., BEIR), reinforcing its suitability as a baseline in scenarios lacking extensive in-domain annotation (2104.08663).

3. Performance Benchmarks and Empirical Evidence

Across multiple studies and settings, BM25-based few-shot retrieval exhibits the following empirical characteristics:

Zero-/Few-Shot Robustness: BM25 serves as a consistently strong baseline with high OOD generalization. In BEIR, BM25 is often outperformed only by computationally intensive cross-encoders or late-interaction models, while exceeding many dense retrievers on unseen tasks (2104.08663).
Competitive QBE Results: In SciDocs QBE tasks, BM25 rivals or surpasses contextualized models for long queries. Interpolating BM25 with contextual relevance signals yields statistically significant gains, evidencing the complementarity between surface and semantic match (2210.05512).
Demonstration Selection Gains: When selecting demonstrations for LLM few-shot prompting, BM25-based RAG increases extraction F1 by up to +14.8% and downstream inference $R^2$ by 29.4%, outperforming both random selection and dense embedding retrieval for highly domain-specific paragraphs (2408.04665).
Slot Filling Limitations: For tasks requiring fine-grained contextual disambiguation (e.g., paraphrased slot values), BM25 underperforms semantic retrieval methods, as observed by substantial F1 gaps on SNIPS slot filling and similar datasets (2104.05763).

Setting	BM25 Baseline	Enhanced/Hybrid	Relative Gain
Few-shot slot filling (F1)	Low 50s	Span-level semantic retrieval	+20 F1 (on SNIPS, 5-shot)
QBE (MAP, SciDocs)	~80–82%	BM25+TILDE interpolation	+2-3% absolute, significant
RAG extraction (F1, MOFs)	0.81 (0-shot)	0.93 (BM25-RAG 4-shot)	+14.8% absolute

4. Hybrid and Interpolation Techniques

BM25 often serves as a backbone in hybrid retrieval models for few-shot scenarios:

Hybrid Scoring: Score interpolation between BM25 and neural/contextual signals enables capturing both surface-form and latent semantic relevance, yielding statistically significant improvements across QBE and entity-centric retrieval (2210.05512).
Feedback Integration: BM25 is combined with few-shot feedback-adapted neural rerankers (e.g., parameter-efficient fine-tuned cross-encoders or kNN-based scoring) to integrate explicit user or system feedback, delivering 5.2 nDCG@20 gains over pure lexical feedback expansion (2210.10695).
Retrieval-Augmented Prompting: BM25 is used for adaptive demonstration selection in LLM prompting, and its performance is further enhanced when integrated with more advanced retrieval or reranking modules (e.g., via Reciprocal Rank Fusion or meta-learned interpolation).

5. Adaptability and Computational Efficiency

BM25-based few-shot retrieval approaches offer key benefits for real-world deployment:

Non-parametric Adaptation: Both classic and hybrid BM25 systems only require adding or updating indexed examples for new domains or classes, enabling instant adaptation without retraining (2104.05763).
Computational Economy: BM25 retrieval is highly efficient (e.g., ~20ms/query for 1M docs, index size ~0.4GB), contrasting with the heavier compute and memory requirements of dense retrievers, especially when cross-domain or real-time adaptation is critical (2104.08663).
Transparency and Simplicity: The model's operation is interpretable and its performance reliable across domains, supporting use as both a robust baseline and first-stage retrieval mechanism in hybrid pipelines.

6. Limitations and Directions for Further Research

BM25-based few-shot retrieval, while strong and adaptable, faces several limitations:

Contextual Inexpressiveness: BM25 cannot capture paraphrase, polysemy, or deeper semantic similarity, making it suboptimal for tasks with high linguistic variability or those requiring reasoning (2104.05763, 2210.05512).
Structured Output and Span Labeling: In slot filling or other structured prediction settings, BM25 is challenged by the need for span-level contextualization and non-overlapping constraint handling, as addressed by semantic span-level retrieval and batch-softmax training (2104.05763).
Lexical Bias: Datasets built or annotated using BM25-based pools may overstate its effectiveness due to corpus-specific term bias (2104.08663). This suggests that hybrid or semantic approaches are often necessary for fair generalization assessments in few-shot settings.
Acceleration via Document Expansion: Augmenting BM25 retrieval pipelines with synthetic queries (e.g., docT5query) or integrating cross-encoders as rerankers further enhances performance with limited increase in resource requirements (2104.08663).

7. Summary Table: BM25-Based Few-Shot Retrieval Strategies

Aspect	BM25 Classic	Hybrid/Semantic BM25	Strength/Role
Query modality	Token/lexical	Token + learned/semantic	Explicit lexical match
Adaptation	Add-to-index	Add-to-index, hybrid rerank	Rapid, retrain-free
Training required	None	Minimal (only for augmented)	Low/none
Contextual capacity	Limited	Moderate–high (hybrid)	Add via span/embedding
Benchmark role	Baseline, first-stage	Interpolation/fusion anchor	Reference + improvement
Performance	OOD strong baseline	Near-SOTA when hybridized	Few-shot, cross-domain

References

Dian Yu et al., "Few-shot Intent Classification and Slot Filling with Retrieved Examples" (2104.05763)
Nandan Thakur et al., "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models" (2104.08663)
Ahmad M. Rashid et al., "On the Interpolation of Contextualized Term-based Ranking with BM25 for Query-by-Example Retrieval" (2210.05512)
Thomas Baumgärtner et al., "Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking" (2210.10695)
Zizheng Lin et al., "LLM-based MOFs Synthesis Condition Extraction using Few-Shot Demonstrations" (2408.04665)