BM25 Ranking Function Overview

Updated 21 March 2026

BM25 is a probabilistic ranking function that computes a relevance score using term frequency, inverse document frequency, and document-length normalization.
It employs parameters k1 and b to control the impact of term frequency saturation and length normalization, ensuring robust performance across various IR tasks.
BM25 serves as a strong baseline in legal, biomedical, and neural IR applications and is often integrated with learning-to-rank and semantic models for enhanced retrieval.

BM25 is a probabilistically motivated, term-weighting-based document ranking function that occupies a central role in modern information retrieval (IR) systems. It operationalizes relevance as a function of exact query–document term overlap, subject to non-linear term-frequency scaling, document-length normalization, and global rarity of terms. BM25 continues to set a strong baseline for both classical and neural IR architectures, exhibiting robust retrieval performance across a wide range of domains and datasets (Rosa et al., 2021).

1. Mathematical Formulation and Core Mechanics

BM25 assigns to each document–query pair a scalar relevance score via a weighted sum over all query terms present in the document. The canonical formula is: $\mathrm{BM25}(D,Q) = \sum_{i=1}^{n} \mathrm{idf}(q_i) \frac{\,\mathrm{tf}(q_i,D)\,(k_{1}+1)}{\mathrm{tf}(q_i,D) + k_{1}\left(1 - b + b\,\frac{|D|}{\mathit{avgdl}}\right)}$ where:

$Q = \{q_1, ..., q_n\}$ : query terms,
$D$ : candidate document,
$\mathrm{tf}(q_i,D)$ : frequency of $q_i$ in $D$ ,
$|D|$ : document length (tokens),
$\mathit{avgdl}$ : average document length in the collection,
$k_1$ : term-frequency scaling parameter (controls TF saturation),
$b$ : document-length normalization (interpolates between no normalization, $Q = \{q_1, ..., q_n\}$ 0, and strict normalization, $Q = \{q_1, ..., q_n\}$ 1),
$Q = \{q_1, ..., q_n\}$ 2: inverse document frequency:

$Q = \{q_1, ..., q_n\}$ 3

with $Q = \{q_1, ..., q_n\}$ 4 the total number of documents, $Q = \{q_1, ..., q_n\}$ 5 the document frequency of $Q = \{q_1, ..., q_n\}$ 6 (Kim et al., 2016, Rosa et al., 2021, Askari et al., 2023).

Typically, $Q = \{q_1, ..., q_n\}$ 7 and $Q = \{q_1, ..., q_n\}$ 8 are used as robust defaults, but values such as $Q = \{q_1, ..., q_n\}$ 9, $D$ 0 are validated for specific domains (e.g., biomedical abstracts) (Kim et al., 2016).

2. Parameterization, Model Behavior, and Example

The parameters $D$ 1 and $D$ 2 fundamentally control how BM25 interpolates between raw count-based scoring and more nuanced "pivoted normalization" accounting for within-corpus variations in verbosity and length (Rosa et al., 2021, Askari et al., 2023):

$D$ 3 (TF scaling): Larger $D$ 4 increases the linearity of term frequency scaling, giving more weight to repeated terms; smaller $D$ 5 saturates faster, reducing the marginal gain of repeated term matches.
$D$ 6 (length normalization): $D$ 7 ignores length; $D$ 8 fully normalizes by the relative length with respect to $D$ 9. Longer-than-average documents are penalized when $\mathrm{tf}(q_i,D)$ 0.

For example, with $\mathrm{tf}(q_i,D)$ 1, $\mathrm{tf}(q_i,D)$ 2, $\mathrm{tf}(q_i,D)$ 3, $\mathrm{tf}(q_i,D)$ 4, $\mathrm{tf}(q_i,D)$ 5, $\mathrm{tf}(q_i,D)$ 6, $\mathrm{tf}(q_i,D)$ 7, the score contribution is: $\mathrm{tf}(q_i,D)$ 8

$\mathrm{tf}(q_i,D)$ 9

$q_i$ 0

$q_i$ 1

(Askari et al., 2023).

3. Variants and Extensions: Multi-Field, Proximity, Query-Dependent Normalization

BM25 forms the core of numerous adaptive scoring schemes:

BM25F (Multi-field):

Aggregates term frequencies across structured fields (title, abstract, body) using per-field weights and normalization:

$q_i$ 2

where $q_i$ 3 consists of field-specific boosted and normalized counts (Manabe et al., 2017).

Proximity-based BM25:

Incorporates proximity heuristics via "Expanded Span" methods, extracting spans of near-occurrence query terms and replacing term frequency by a relevance-weighted sum over such spans. This approach rewards documents where query terms appear in close textual proximity, even across different fields (Manabe et al., 2017).

Query-Dependent Length Normalization:

Standard BM25 normalizes only by $q_i$ 4, ignoring query length. Proximity-matching generalizations replace this with a two-variable factor $q_i$ 5, designed to down-weight verbosity mismatches and up-weight document–query pairs with similar lengths:

$q_i$ 6

This modification can yield substantial gains in applications where matching verbosity is correlated with relevance (e.g., penpal recommendation: 52% MRR improvement over BM25) (Agrawal, 2017).

4. Empirical Benchmarks and Applications

BM25 serves as a robust baseline and strong IR performer across diverse tasks and domains:

Legal Case Retrieval: Achieved F1=0.0937 (second place) on COLIEE 2021 legal case retrieval with default Pyserini settings ( $q_i$ 7) and minimal preprocessing (default tokenization, stopword removal, stemming, and sliding-window document segmentation) (Rosa et al., 2021).
Biomedical Data: On PubMed, $q_i$ 8 provided state-of-the-art performance for short queries over large collections (162,259 PubMed abstracts; mean average precision 0.2463–0.3136 on TREC Genomics). In user log data (28K PubMed queries; 27M docs), BM25 outperformed baseline IR models and, when combined with semantic matching in LambdaMART, increased NDCG@20 by 23% (Kim et al., 2016).
Zero-shot, Passage, and Semantic IR Benchmarks: Remains highly competitive in MSMARCO and TREC-DL, where it is often used as the initial candidate generator for neural rerankers (Askari et al., 2023, Lu et al., 7 Feb 2025).

5. Integration with Neural and Learning-to-Rank Models

BM25 forms the foundation of multi-stage pipelines, where its exact-match bias and interpretability are leveraged alongside modern neural models:

Hybrid Feature Fusion: Combining BM25 scores with semantic similarity (e.g., embedding-based methods) in learning-to-rank frameworks (e.g., LambdaMART) yields additive gains. On PubMed, such combinations increased NDCG@10 by up to 25% (Kim et al., 2016).
Cross-Encoder Signals: Injecting normalized BM25 scores as integer tokens into BERT-based cross-encoder rerankers outperforms linear or non-linear score interpolation, with injected tokens shown (via feature attribution) to be among the model's most influential ranking features (Askari et al., 2023).
Semantic BM25 in Transformers: Cross-encoders (e.g., MiniLM) intrinsically learn a "semantic BM25" circuit, with attention heads approximating soft term frequency (with built-in saturation and length normalization) and embedding singular vectors encoding a corpus IDF analogue. This learned variant not only recovers the effects of classical BM25 but is extended semantically—registering paraphrase and synonymy as soft TF (Lu et al., 7 Feb 2025).

6. Practical Implementation and Tuning

BM25 is widely implemented in IR toolkits (Lucene, Pyserini/Anserini) with defaults $q_i$ 9 and is robust under minimal preprocessing (tokenization, stemming, stopword removal). Application-specific adjustments may include:

Passage/Window Segmentation: Improves effectiveness in long-document domains by aligning retrieval units with local contexts (Rosa et al., 2021).
Field Importance and Proximity Tuning: Boosted field weights, proximity span sizes, and length-normalization parameters are best fit via grid search or, increasingly, as part of a differentiable learning-to-rank optimization (Manabe et al., 2017).
Parameter Sensitivities: BM25 is generally robust to moderate variation in $D$ 0 and $D$ 1, but domain adaptation (e.g., biomedical vs. legal) can benefit from targeted tuning (Kim et al., 2016, Rosa et al., 2021).

7. Limitations, Enhancements, and Future Research

BM25's bag-of-words architecture makes it reliant on exact (or near-exact) term overlap, with limitations in capturing semantic similarity where lexicon does not match directly (Kim et al., 2016). Adaptive extensions address some weaknesses:

Semantic Enhancements: Neural models extend BM25 via learned soft-counts and embedding-based IDF proxies, providing semantic generalization beyond token identity (Lu et al., 7 Feb 2025).
Proximity and Field-Weighting: Further development of proximity features (spans), query-dependent normalization, and learnable parameters offers domain- and application-specific gains (Manabe et al., 2017, Agrawal, 2017).
Hybrid and Interpretable Models: Recent research highlights the natural emergence of BM25 analogues in neural rankers, raising prospects for hybrid designs with explicit, transparent control over $D$ 2 and interpretable IDF weighting (Lu et al., 7 Feb 2025).

In empirical IR practice and methodological research, BM25 persists as the paradigmatic lexical scoring approach—benchmarking systems from legal to biomedical retrieval, and providing a mechanistically transparent scaffold upon which both efficient sparse and state-of-the-art neural systems are built.