Query-side BM25: Enhanced Query Normalization

Updated 21 November 2025

Query-side BM25 is an extension of BM25 that treats queries as pseudo-documents, applying saturation and length normalization to improve retrieval for long queries.
It is integrated into major IR toolkits like Anserini and Pyserini, enabling reproducible experiments on reasoning-intensive and long-query benchmarks.
Empirical studies show that query-side BM25 increases nDCG@10 by up to 8% on complex retrieval tasks, particularly benefiting retrieval-augmented generation pipelines.

Query-side BM25 is an extension of the established BM25 family of sparse retrieval scoring functions that applies document-side BM25 saturation and length-normalization to the query representation, rather than relying on a pure bag-of-words vector. This modification is particularly relevant for retrieval tasks involving long or complex queries, such as those in reasoning-intensive datasets and retrieval-augmented generation (RAG) pipelines, where standard BM25's query-side weighting can be suboptimal. Query-side BM25 has been recently integrated into widely used toolkits including Anserini and Pyserini to support consistent, reproducible experimentation across modern retrieval benchmarks (Ge et al., 2 Sep 2025).

1. Mathematical Definition and Principal Differences

Standard BM25 computes the score for a document $d$ with respect to a query $q$ as follows: $\mathrm{score}_{\mathrm{BM25}}(d,q) = \sum_{t\in q} \mathrm{idf}(t) \cdot \frac{f_{t,d}(k_1+1)}{f_{t,d} + k_1(1-b + b |d|/\overline{d})} \cdot \frac{f_{t,q}(k_3+1)}{f_{t,q} + k_3}$ where typically $k_3 \to \infty$ and only the query term frequency $f_{t,q}$ is used, without saturation or length normalization. The idf term is typically defined as $\mathrm{idf}(t)=\log \frac{D-n_t+0.5}{n_t+0.5}$ , with $n_t$ the number of documents containing term $t$ out of collection size $D$ .

Query-side BM25 ("BM25Q") modifies this by subjecting the query vector to the same saturation and length-normalization as the document side: $w_{t,q} = \frac{f_{t,q}(k_1+1)}{f_{t,q} + k_1(1-b + b|q|/\overline{q})}$ The final score is then computed as (absorbing one idf factor for normalization): $\mathrm{score}_{\mathrm{BM25Q}}(d,q) = \sum_{t\in q} \mathrm{idf}(t) \; w_{t,d} \; w_{t,q}$ where $w_{t,d}$ is as in standard BM25 and $w_{t,q}$ uses the true query-side saturation and length penalty (Ge et al., 2 Sep 2025).

2. Construction and Implementation in Retrieval Toolkits

Traditional BM25 implementations (e.g., Lucene, Anserini, Pyserini) treat the query vector as a simple bag-of-words, emitting count-weighted tokens for each term without regard to query-side saturation. The query-side BM25 generator constructs a token-weighted query vector using the $w_{t,q}$ function, ensuring that repeated query terms face diminishing returns and that long queries are normalized by their length, mirroring document-side characteristics. Anserini and Pyserini provide interfaces to switch between standard and query-side BM25 through a dedicated query generator class and explicit API flags, enabling drop-in comparison and integration on benchmarks such as BRIGHT and BEIR (Ge et al., 2 Sep 2025).

3. Impact on Long-Query and Reasoning-Intensive Retrieval

In settings where queries contain medium to large numbers of tokens—ranging from standard web search up to multi-sentence queries, LLM prompts, or document queries (as in QBE and RAG)—query-side BM25 is significantly more robust. It down-weights boilerplate or repeated context, and emphasizes less frequent, discriminative terms. On the BRIGHT benchmark, which is designed for reasoning-intensive queries, query-side BM25 increases average nDCG@10 from 0.137 (standard BM25) to 0.148 (+8%), with notable improvement on tasks such as TheoremQA and Pony that feature prompts of 16–256 tokens (Ge et al., 2 Sep 2025). For very short queries (<16 tokens), both variants perform almost identically; the differences become pronounced as the query length increases.

4. Integration with RAG and LLM-Driven Expansion

Modern retrieval pipelines involving LLMs commonly produce long and varied queries, either through agentic multi-step reasoning or query expansion with generative models. Query-side BM25 directly accommodates such settings by penalizing verbosity and repeated tokens and by rewarding coverage of distinctive lexicon. The authors explicitly recommend use of BM25Q in retrieval-augmented generation (RAG) tasks, where prompt lengths regularly exceed classic search engine norms. Moreover, the interface in Pyserini facilitates experimentation and deployment in these emerging applications without the need for custom code (Ge et al., 2 Sep 2025).

5. Complementarity with LM-driven and Contextual Models

Applications such as query-by-example (QBE) retrieval and fusion with neural rerankers benefit from maintaining a robust term-based signal. For instance, the interpolation of BM25 and contextualized term-based models (e.g., TILDE, TILDEv2) leads to statistically significant gains on long-query retrieval tasks; BM25’s term-frequency, saturation, and length-normalization interact complementarily with learned context-aware term weights (Abolghasemi et al., 2022). In practical two-stage pipelines, keeping a strong query-side BM25 component enables effective integration with reranking and expansion strategies.

6. Empirical Evaluation and Adoption

Empirical studies demonstrate that query-side BM25 is a low-cost modification yielding robust gains. On the BRIGHT benchmark, improvements in nDCG@10 and task coverage have been verified across Anserini/Pyserini implementations, confirming that BRIGHT originally used a query-side variant. Analysis reveals substantial impact in medium-length and reasoning-centric settings; for extremely long queries, gains taper as length normalization dominates but remain non-negative (Ge et al., 2 Sep 2025). The transparent integration in open-source IR toolkits enables reproducibility and further research.

7. Future Directions and Recommendations

Given the increasing prevalence of long and complex queries in contemporary IR, particularly in retrieval-augmented LLM workflows, query-side BM25 is poised to become a standard choice. Future work may consider combining BM25Q with agentic multi-step LLM expansion, reinforcement learning for query specification (e.g., QUESTER (Satouf et al., 7 Nov 2025)), or neural reranking to further enhance performance on complex tasks. Community adoption in open-source toolkits streamlines experimentation and application, with recommendations to include accurate length normalization and saturation on both query and document vectors in future retrieval systems.

In summary, query-side BM25 generalizes the classic scoring function by treating queries as “pseudo-documents,” offering improved normalization and robustness for long, verbose, or expansion-enriched queries. Its integration into major IR libraries and consistent empirical gains make it a compelling baseline for modern retrieval scenarios involving long prompts, LLMs, and reasoning-intensive pipelines (Ge et al., 2 Sep 2025, Abolghasemi et al., 2022, Satouf et al., 7 Nov 2025).