Anserini BM25: Robust Lexical Ranking

Updated 14 May 2026

ANSERINI BM25 is a lexical ranking algorithm that implements the Okapi BM25 model atop Apache Lucene, ensuring precise term-based relevance scoring.
It leverages both traditional bag-of-words and innovative query-side BM25 approaches to manage long, reasoning-intensive queries effectively.
Tuning parameters like k1 and b in Anserini BM25 significantly enhance retrieval recall and accuracy in deep, retrieval-augmented language model pipelines.

Anserini BM25 is a lexical ranking algorithm and retrieval backend implemented atop the Apache Lucene search library and widely adopted in academic information retrieval (IR) research. The implementation exposes faithful, parameterizable variants of the Okapi BM25 probabilistic retrieval model, which is a standard for scoring term-based (bag-of-words) document relevance and forms the baseline in a broad spectrum of IR and retrieval-augmented LLM pipelines. Anserini provides a suite of tools, exposing both default and advanced BM25 retrieval options, now including recent extensions for handling long, reasoning-intensive queries.

1. Mathematical Foundations and Implementation

BM25 in Anserini (via Lucene’s BM25Similarity) implements the Robertson–Zaragoza probabilistic scoring function:

$\mathrm{score}(q, d) = \sum_{t\in q} \mathrm{idf}(t) \cdot \frac{(k_1+1)\,\mathrm{tf}_{t,d}}{\mathrm{tf}_{t,d} + k_1(1 - b + b\,\frac{|d|}{\mathit{avgdl}})}$

where:

$\mathrm{tf}_{t,d}$ is the frequency of term $t$ in document $d$ ,
$|d|$ is the document length in tokens,
$\mathit{avgdl}$ is the average document length in the collection,
$k_1 > 0$ controls term-frequency saturation,
$b \in [0,1]$ controls length normalization,
$\mathrm{idf}(t) = \log \frac{N-\mathrm{df}(t)+0.5}{\mathrm{df}(t)+0.5}$ with $N$ as the number of indexed documents, and $\mathrm{tf}_{t,d}$ 0 as the count of documents containing $\mathrm{tf}_{t,d}$ 1.

Lucene’s BM25Similarity exposes these computations directly to Anserini. Term and document statistics (such as frequency, length) are obtained from the Lucene index; no special indexing hooks are needed, as length normalization is handled at query time (0911.5046). BM25F, a multi-field generalization, is also described, but is not directly available in standard Anserini; approximation via field concatenation or per-field boosting is common.

Default Lucene BM25 parameters are $\mathrm{tf}_{t,d}$ 2, $\mathrm{tf}_{t,d}$ 3, while Anserini’s own defaults (often used in QA or passage retrieval) are $\mathrm{tf}_{t,d}$ 4, $\mathrm{tf}_{t,d}$ 5 (0911.5046, Yang et al., 2019). All parameters can be tuned by the end-user via driver flags.

2. Query Construction and Query-Side BM25

Anserini’s default approach for query formulation is the “bag-of-words” (BoW) linear model: for a query $\mathrm{tf}_{t,d}$ 6, the per-term weight is the raw frequency $\mathrm{tf}_{t,d}$ 7; length normalization and TF saturation are applied only on the document side during scoring (Ge et al., 2 Sep 2025). This suffices for short queries typically seen in benchmarks (e.g., TREC, BEIR).

Recent studies have identified limitations of this model for long, reasoning-intensive queries generated by retrieval-augmented LLMs or benchmarks like BRIGHT, where query lengths commonly exceed 55–200 tokens (Ge et al., 2 Sep 2025). In such scenarios, repeated terms can dominate, and raw-count BoW representation leads to over-weighting.

BRIGHT introduces a "query-side BM25" (BM25Q) weighting, in which query terms themselves are scored using the BM25 formula:

$\mathrm{tf}_{t,d}$ 8

where $\mathrm{tf}_{t,d}$ 9 and $t$ 0 are the length and average length of queries, respectively. The overall matching score is then a true BM25-style inner product over both query and document BM25 vectors:

$t$ 1

Empirical evaluation demonstrates that this approach yields consistent nDCG@10 improvements across reasoning-heavy tasks, especially for medium and long queries, by saturating term frequencies and normalizing for query length (Ge et al., 2 Sep 2025). Anserini and Pyserini now provide first-class support for query-side BM25 via a new QueryGenerator and Python API flags.

3. Parameterization and Tuning Practices

BM25 parameters $t$ 2 (controls term frequency scaling) and $t$ 3 (controls document length normalization) have significant impacts depending on corpus and task. While $t$ 4, $t$ 5 are “generic” retrieval defaults (0911.5046), for question answering and long-passage retrieval, $t$ 6, $t$ 7 are commonly adopted in Anserini (Yang et al., 2019, Ge et al., 2 Sep 2025).

Systematic tuning over larger, noisy, or long-document corpora, for example in agentic pipelines using LLMs, shows that increasing $t$ 8 and $t$ 9 (e.g., up to $d$ 0 as in Pi-Serini) substantially increases surfaced recall and accuracy: in BrowseComp-Plus, accuracy improves from 64% to 82% and surfaced-evidence recall from 84.6% to 95.7% when tuning from $d$ 1 to $d$ 2 (Hsu et al., 11 May 2026). Performance should thus be empirically optimized for downstream needs.

System	$d$ 3	$d$ 4	Surfaced-Evidence Recall	Accuracy
Default	0.9	0.4	84.6%	64.0%
Pi-Serini	25	1.0	95.7%	82.0%

4. System Integration and Downstream Pipelines

Anserini BM25 supplies robust candidate retrieval for a wide range of downstream applications, including open-domain question answering, RAG, and LLM-based research agents. In the BERTserini pipeline, Wikipedia is indexed at document, paragraph, and sentence granularities; BM25 retrieves top- $d$ 5 spans, and each is individually scored by a fine-tuned BERT reader (Yang et al., 2019). Final answer extraction hinges on the linear interpolation:

$d$ 6

with $d$ 7 tuned empirically (optimal $d$ 8).

BM25 recall generally exceeds end-to-end accuracy; in BERTserini, paragraph-level recall reaches 85.8% at $d$ 9, while EM accuracy caps at 38.6%. This suggests that retrieval is not the primary bottleneck; failures are more often due to extraction and scoring or aggregation steps.

In retrieval-augmented LLM pipelines such as Pi-Serini (Hsu et al., 11 May 2026), retrieval depth contextualizes agentic performance: evidence recall rises sharply up to $|d|$ 0, underscoring that shallow retrieval settings may obscure the capacity of well-tuned lexical BM25 retrieval.

5. Empirical Performance and Observed Behaviors

Key findings regarding BM25 within Anserini-based systems include:

Lexical BM25, with proper parameter tuning and deep retrieval ( $|d|$ 1), can match or exceed dense embedding baselines in agentic LLM workflows (Hsu et al., 11 May 2026).
Query-side BM25 (BM25Q) is most beneficial for medium-to-long queries, common in recent RAG and prompt-driven research settings (Ge et al., 2 Sep 2025).
In BERTserini, paragraph-level retrieval provides a balance between contextual sufficiency and avoidant of distractors; both document-level (too broad) and sentence-level (too narrow) retrieval perform worse (Yang et al., 2019).
Increasing retrieval depth confers substantial gains in surfaced evidence: from 70.48% at $|d|$ 2 to 95.78% at $|d|$ 3; these improvements accrue until saturation, after which further increases yield diminishing returns (Hsu et al., 11 May 2026).

$\|d\|$ 4	Surfaced Recall (%) (tuned BM25)
5	70.48
1000	95.78

6. Practical and System-Level Considerations

Lucene’s BM25Similarity enables Anserini to provide high-performance, numerically stable BM25 ranking with minimal index-time requirements (0911.5046). All relevant statistics (term and document frequency, doc lengths) are maintained at index time and queried at retrieval. No additional hooks are required to gather average lengths. Parameter overrides are available via both the Java interface and command-line flags.

For multi-field documents, native BM25F is not included in Lucene. Alternatives involve concatenating fields, using per-field boosts via MultiFieldQueryParser, or external implementations such as Pérez-Iglesias’s models.jar (0911.5046). Correct field-level idf computation is a subtlety: accurate document-level $|d|$ 5 for rare terms may require a catch-all field in fielded collections.

The integration of query-side BM25 in both Anserini and Pyserini allows researchers to toggle between BoW and BM25Q representations. Command-line flags and API switches select the appropriate generator and determine whether to use Lucene’s approximate (bucketized) or accurate length norms, with the latter recommended for highest reproducibility (Ge et al., 2 Sep 2025).

7. Implications and Recommendations for Future Use

The evolution of complex, long-form queries—especially in retrieval-augmented LLM settings—highlights the necessity for treating both query and document sides with appropriate term-frequency saturation and length normalization. Empirical evidence affirms the payoff of query-side BM25 for long and reasoning-intensive queries (Ge et al., 2 Sep 2025).

Optimal BM25 setup in practice involves:

Tuning $|d|$ 6 and $|d|$ 7 for the underlying collection and task (higher values favor long documents and high noise),
Adopting deep retrieval pools (up to $|d|$ 8) to maximize surfaced recall in agentic scenarios,
Using query-side BM25 when queries exceed $|d|$ 9 tokens or are otherwise verbose,
Consulting system-specific error analyses to locate bottlenecks (retrieval, extraction, or ranking)—with recent evidence suggesting retrieval is no longer the limiting factor in well-tuned pipelines (Yang et al., 2019, Hsu et al., 11 May 2026).

The cumulative findings across BERTserini, Pi-Serini, and BRIGHT benchmarks substantiate the continued relevance and performance competitiveness of BM25 in Anserini—especially as retrieval tasks shift toward ever longer and more complex prompts (Yang et al., 2019, Ge et al., 2 Sep 2025, Hsu et al., 11 May 2026).