BEIR Benchmark for IR Models
- BEIR Benchmark is a zero-shot evaluation suite that aggregates diverse IR tasks to measure cross-domain generalization, covering domains like biomedical, finance, and more.
- It offers a modular data format and an open-source codebase, enabling rapid extension to multilingual and specialized information retrieval applications.
- Evaluations use metrics like nDCG@k and Recall@k, with BM25 often serving as a robust baseline to assess out-of-distribution retrieval performance.
BEIR (Benchmarking Information Retrieval) is a heterogeneous, zero-shot evaluation benchmark designed to rigorously assess the out-of-distribution generalization of information retrieval (IR) models across a broad spectrum of retrieval tasks, domains, and data modalities. Conceived as a response to the limitations of IR model evaluation in narrowly defined, in-domain settings (e.g., MS MARCO), BEIR synthesizes a suite of diverse tasks to stress-test retrieval systems' cross-domain robustness. Since its introduction, BEIR has catalyzed advancements in dense, sparse, cross-encoder, and hybrid retrieval methodologies and has become the de facto standard for reporting zero-shot retrieval effectiveness in both academic and applied contexts (Thakur et al., 2021).
1. Motivation and Design Principles
BEIR was created to address two fundamental deficits in IR model evaluation: (i) the lack of a unified benchmark for heterogeneous, out-of-distribution (OOD) generalization, and (ii) the resulting fragmentation of empirical comparisons across IR research. Traditional IR models evaluated on single-domain corpora (notably, web passage datasets like MS MARCO) often failed to expose brittleness in out-of-domain or task-shifted scenarios, such as biomedical retrieval, argument mining, or fact verification. BEIR confronts these gaps by aggregating 18 (later, more) public retrieval datasets covering nine principal retrieval task classes, including open-domain QA, fact-checking, argument retrieval, entity search, duplicate question detection, and citation prediction. Its modular, task-agnostic data format (corpus, queries, qrels) and open-source codebase enable rapid extension to new datasets and architectures (Thakur et al., 2021, Kamalloo et al., 2023).
2. Benchmark Composition and Dataset Taxonomy
The canonical BEIR suite comprises the following datasets, each reflecting unique retrieval challenges in terms of domain, corpus size, query formulation, and relevance judgments:
| Dataset | Domain | Task |
|---|---|---|
| TREC-COVID | Biomedical | IR |
| NFCorpus | Biomedical | IR |
| BioASQ | Biomedical | QA |
| NaturalQuestions | Wikipedia | QA |
| HotpotQA | Wikipedia | Multi-hop QA |
| FiQA-2018 | Finance | QA |
| Signal-1M (RT) | Tweet retrieval | |
| TREC-NEWS | News | News retrieval |
| Robust04 | News | News retrieval |
| ArguAna | Miscellaneous | Argument retrieval |
| Touché-2020 | Miscellaneous | Argument retrieval |
| CQADupStack | StackExchange | Duplicate QA |
| Quora | Quora | Duplicate QA |
| DBPedia | Wikipedia | Entity retrieval |
| SCIDOCS | Scientific | Citation prediction |
| FEVER | Wikipedia | Fact checking |
| Climate-FEVER | Wikipedia | Fact checking |
| SciFact | Scientific | Fact checking |
Corpus sizes range from several thousand (e.g., SciFact) to millions of documents (e.g., BioASQ, DBPedia), and query types span from terse keyword queries to full sentences and multi-hop question formulations. BEIR has been horizontally extended to several languages (e.g., BEIR-PL for Polish (Wojtasik et al., 2023), BEIR-NL for Dutch (Banar et al., 11 Dec 2024), and Hindi-BEIR for Hindi (Acharya et al., 18 Aug 2024, Acharya et al., 9 Sep 2024)), primarily via machine translation and/or curation of native-language IR datasets.
3. Zero-Shot Evaluation Protocols and Metrics
BEIR mandates that models be evaluated in a zero-shot fashion: systems may be pre-trained or fine-tuned on massive, generic or cross-domain corpora (typically MS MARCO or Wikipedia), but no adaptation to any BEIR dataset is permitted prior to testing. This protocol is intended to mirror real-world deployment where in-domain supervision is unavailable.
Core metrics include:
- Normalized Discounted Cumulative Gain at rank k (nDCG@k):
where denotes graded relevance at rank , and is the ideal DCG for normalization. BEIR tasks use either binary or multi-graded qrels.
- Recall@k: Fraction of queries for which at least one relevant document is retrieved in the top .
- Mean Average Precision (MAP@k): Mean of average precisions at across queries.
- hole@10: For some analyses (e.g., Touché-2020), the fraction of unjudged results in the top 10 per query.
4. Baseline Models and System Families
BEIR evaluates a diversity of retrieval paradigms, encompassing:
- Lexical: BM25 (TF–IDF term matching with length normalization), often serving as a robust baseline. Elasticsearch or Anserini (Lucene-based) indexes are standard (Thakur et al., 2021, Kamalloo et al., 2023).
- Sparse Neural Retrieval: Methods such as uniCOIL, DeepImpact, SPARTA, TILDEv2, and SPLADEv2 (Thakur et al., 2023). These systems generate high-dimensional, sparse token-weighted vectors, often leveraging expansion tokens (e.g., SPLADEv2), which have shown superior domain generalization by bridging vocabulary gaps.
- Dense Bi-Encoders: Dual-encoder architectures (TAS-B, DPR, ANCE, Contriever (Izacard et al., 2021), E5 (Wang et al., 2022), SGPT (Muennighoff, 2022)) map both queries and documents to dense, fixed-length vectors, indexed via FAISS for efficient Maximum Inner Product Search (MIPS).
- Late-Interaction Models: ColBERT and variants process each document at the token level, retaining a matrix of embeddings to enable granular MaxSim matching. This approach achieves strong accuracy (and parameter efficiency in new languages, e.g., Turkish (Ezerceli et al., 20 Nov 2025)) albeit at increased indexing and query costs.
- Cross-Encoder Re-Ranking: Reranking the top-k candidates (retrieved by BM25 or a dense retriever) with a full query-document cross-attention model (e.g., MiniLM, SGPT-CE), achieving the highest empirical nDCG@10 on most tasks but limited by computational cost.
5. Key Findings and Empirical Results
- BM25 as a strong baseline: Across the 18 BEIR datasets, BM25 remains a highly competitive zero-shot method (Thakur et al., 2021, Kamalloo et al., 2023), outperforming most neural/sparse models in OOD scenarios absent domain-specific adaptation.
- Neural methods: Dense bi-encoders lagged behind BM25 in early experiments, but advances such as E5 (Wang et al., 2022) and SGPT (Muennighoff, 2022) have yielded dense models surpassing BM25 in aggregate nDCG@10, particularly on semantically challenging domains (FiQA, SciFact, Quora). SPLADEv2 outperforms BM25 on average by leveraging internal expansion tokens (Thakur et al., 2023).
- Domain/task sensitivity: Neural models often underperform on tasks characterized by high term ambiguity, argument retrieval (Touché-2020), or domains with limited tokenization/coverage in the training corpus. Conversely, rerankers and sparse expansion methods show superior robustness in such scenarios.
- Argument retrieval anomaly: On Touché-2020, all neural models tested, including advanced multi-vector systems (CITADEL+), underperform BM25 (BM25 nDCG@10 of 0.367 remains unbeaten). Detailed black-box axiomatic analyses reveal that neural retrievers heavily violate length-normalization constraints (LNC2), manifesting as a bias toward ultra-short, overlap-rich but semantically empty passages, which are prioritized over substantive arguments due to fixed-size embedding architectures and the lack of explicit length normalization (Thakur et al., 10 Jul 2024).
- Data artifacts: The presence of noisy, very short documents and shallow relevance judgments in certain datasets (again, Touché-2020) impedes fair model comparison. Length-based denoising (e.g., excluding <20-word documents) and post-hoc re-judging can dramatically improve neural retriever scores (+0.52 nDCG@10 for TAS-B), though BM25 remains the top performer post-cleaning (Thakur et al., 10 Jul 2024).
6. Multilingual Extensions and Cross-Lingual Challenges
BEIR’s architecture has facilitated porting to multiple languages, primarily by translation and curation:
- Polish (BEIR-PL): 13 translated datasets, revealing a 5–11 point drop in BM25 nDCG@10 vs. English (due to high inflection and morphology). Neural rerankers (plT5-large, HerBERT-large) recover or exceed English state-of-the-art via sequence-to-sequence re-ranking and late-interaction architectures (Wojtasik et al., 2023).
- Dutch (BEIR-NL): Dutch translation of 14 datasets, with back-translation experiments quantifying up to 2–3 points nDCG@10 loss due to translation artifacts. Retrieval-tuned dense models (e.g., e5-large-instruct, gte-multilingual-base) consistently outperform BM25, but domain/local content issues persist (Banar et al., 11 Dec 2024).
- Hindi (Hindi-BEIR): 15 datasets including both translated and native Hindi/cross-lingual tasks, spanning 27M documents and ~200K queries over eight tasks (Acharya et al., 18 Aug 2024, Acharya et al., 9 Sep 2024). Multilingual dense retrievers (BGE-M3, mE5) and the NLLB-E5 stacking approach outperform BM25, which fails catastrophically on cross-script tasks (e.g., CC News Retrieval). However, average NDCG@10 performance remains 15–30 points below analogous English tasks, highlighting challenges of script/vocabulary mismatch and distributional shift.
7. Implications, Limitations, and Future Directions
- Generalization remains the central challenge: Despite progress in model architecture and pretraining strategies (e.g., weakly-supervised contrastive learning, internal/external expansion, length normalization), OOD generalization on semantically or morphologically complex tasks is unresolved (Thakur et al., 2021, Wang et al., 2022, Izacard et al., 2021).
- Evaluation artifacts: Shallow or incomplete judgment pools, especially for open-ended or argumentative retrieval sets, can confound assessment of model progress, necessitating rigorous augmentation and error analysis (Thakur et al., 10 Jul 2024).
- Scaling with compute: Retrieval capacity scales smoothly with pretraining FLOPs, following power-law relationships similar to those for cross-entropy loss in LLMs; small models heavily over-trained can rival the performance of much larger under-trained models (Portes et al., 24 Aug 2025).
- Multilingual evaluation: Translation-based benchmarks are subject to semantic drift and loss of fidelity, placing a practical upper bound on cross-lingual evaluation signal. The need for large, natively-curated datasets in underrepresented languages and more robust cross-lingual alignment strategies remains acute (Wojtasik et al., 2023, Banar et al., 11 Dec 2024, Acharya et al., 18 Aug 2024).
- Reproducibility and reporting: Turnkey reference implementations (e.g., Pyserini, SPRINT (Thakur et al., 2023)), unified submission leaderboards, and artifact badges are critical for reproducibility, transparency, and fair comparison across research groups (Kamalloo et al., 2023).
- Open research avenues: Incorporating classic IR axioms (length normalization, term frequency constraints) as architectural biases or regularizers; hybrid sparse/dense models; task-specific distillation for low-resource domains; and robust statistical significance testing across heterogeneous tasks (Thakur et al., 10 Jul 2024, Kamalloo et al., 2023).
BEIR, together with its emerging multilingual counterparts, is a cornerstone for the rigorous evaluation of IR model generalization and continues to expose domain and architectural limitations fundamental to the development of robust, widely deployable retrieval systems.