TREC-DL & BEIR Subset Benchmarks

Updated 2 October 2025

TREC-DL and BEIR Subset benchmarks are standardized collections for evaluating IR models with extensive annotations and diverse datasets.
They enable the comparison of retrieval paradigms—from lexical and sparse to dense and prompt-based—using metrics like nDCG@10.
Innovative pooling methods and hyperparameter optimizations in these benchmarks drive reproducible research and performance gains in IR tasks.

TREC-DL (Text REtrieval Conference Deep Learning Track) and BEIR (Benchmarking Information Retrieval) subset are foundational benchmarks for evaluating information retrieval models, especially those leveraging neural and LLM architectures. TREC-DL provides large-scale, rigorously annotated test collections for ad hoc passage and document retrieval in the large data regime, while the BEIR benchmark introduces heterogeneity and a focus on zero-shot generalization across 18 diverse datasets, including traditional and specialized IR tasks. Systematic studies of these benchmarks reveal both the strengths and limits of contemporary lexical, sparse, dense, late-interaction, and non-parametric neural ranking paradigms, shaping current evaluation protocols and research directions.

1. Historical Context and Benchmark Definitions

TREC-DL was inaugurated to address ad hoc ranking tasks with unprecedented scale and rigor, making hundreds of thousands of human-labeled query–document pairs available for training and evaluation (Craswell et al., 2020, Craswell et al., 2021, Craswell et al., 10 Jul 2025). The passage and document ranking tasks use the MS MARCO v2 corpora, distinguishing between real human queries and synthetic queries generated by models such as T5 or GPT-4 (Craswell et al., 10 Jul 2025). Evaluation is conducted using test queries strictly held out from corpus construction, yielding challenging and reusable test sets.

BEIR emerged to stress-test retrieval models in zero-shot settings, collating 18 datasets from disparate domains and IR tasks (biomedical, news, QA, argument mining, citation prediction). Its unified format and standardized evaluation using nDCG@10 facilitate fair, cross-domain model comparison and highlight out-of-distribution generalization challenges (Thakur et al., 2021, Kamalloo et al., 2023). Several BEIR collections—TREC-COVID, TREC-NEWS, Robust04, and Touché (argument retrieval)—are themselves derived from historical TREC efforts, cementing inter-benchmark continuity.

2. Evaluation Metrics and Pooling Methodologies

Both TREC-DL and BEIR utilize graded relevance judgments and report effectiveness primarily through nDCG@10, with auxiliary metrics such as MAP, Recall@100, and Precision@10 for finer analysis (Craswell et al., 2020, Craswell et al., 2021, Craswell et al., 10 Jul 2025). nDCG@k is defined for ranked lists as:

$\text{nDCG@}k = \frac{DCG@k}{IDCG@k} \qquad\text{where}\qquad DCG@k = \sum_{i=1}^k \frac{2^{rel_i}-1}{\log_2(i+1)}$

Pooling for relevance judgment in TREC-DL is performed with initial shallow pools expanded by classifier-based sampling (e.g., HiCAL), maximizing coverage while minimizing annotation bias (Craswell et al., 2021). BEIR converts all datasets into a unified structure (document corpus, query list, qrels), allowing plug-and-play evaluation of competing methods, and leverages pooling approaches tailored to each collection’s judgments (Thakur et al., 2021).

3. Retrieval Paradigms and Model Families

TREC-DL and BEIR foster the comparison of lexical, sparse, dense, late-interaction, re-ranking, and prompt-based retrieval families (Thakur et al., 2021, Kamalloo et al., 2023, Craswell et al., 10 Jul 2025):

Paradigm	Representative Models	Strengths/Weaknesses
Lexical	BM25, Query Likelihood Model	Robust baseline; excels in short, lexically relevant settings
Sparse	SPLADE, DocT5query, DeepCT, SPARTA	Efficient, effective on in-domain and robust to imperfect signals, especially at scale
Dense	DPR, ANCE, TAS-B, GenQ	Bi-encoder for semantic matching; strong in-domain, struggles with domain shift
Late-Interaction	ColBERT, ColBERT-v2, ConstBERT	Token-level interactions, high quality, costly in storage and latency
Re-ranking	monoELECTRA, monoT5, BM25+CE	Best effectiveness, but slower inference; used atop first-stage retrieval
Prompt-based	LLM (GPT-4, T5), Few-shot PRP	Recent advances show effectiveness surpassing supervised models (Sinhababu et al., 26 Sep 2024, Craswell et al., 10 Jul 2025)

Zero-shot evaluations in BEIR reveal BM25 to be surprisingly robust across datasets; however, prompt-based LLMs and hybrid approaches are rapidly closing the gap (Sinhababu et al., 26 Sep 2024, Craswell et al., 10 Jul 2025).

4. Methodological Innovations and Practical Applications

Recent studies highlight several methodological advances:

Hyperparameter Optimization: Bayesian optimization frameworks (e.g. BOIR) automatically tune classical IR models to competitive performance (Gysel et al., 2018), with the surrogate and acquisition function machinery applicable to neural architectures.
Few-shot Non-parametric Models: Pairwise ranking via in-context prompting (few-shot PRP) yields retrieval effectiveness on TREC-DL and BEIR nearly matching state-of-the-art supervised cross-encoders, without parametric training, using lightweight LLMs for pairwise document comparisons and aggregating across candidate lists (Sinhababu et al., 26 Sep 2024).
Sparse Retrieval Scaling: Large decoder-only LLMs trained with a combination of contrastive loss (CL) and knowledge distillation (KD) show scaling benefits. Sparse retrieval paradigms, particularly Lion-SP-8B, outperform dense methods in most settings (Zeng et al., 21 Feb 2025).
Efficient Multi-vector Retrieval: Fixed-vector document representation (ConstBERT) compresses variable-length token-level embeddings to a constant number, halving storage and latency penalties without sacrificing effectiveness on MSMARCO and BEIR (MacAvaney et al., 2 Apr 2025).
Listwise-to-Graph (L2G): Document relations are induced from listwise reranker logs, avoiding quadratic graph construction costs but matching oracle-based graph methods in effectiveness; online induction allows efficient handling of dynamic corpora (Yoon et al., 1 Oct 2025).
Generative Feedback and Adaptive Re-ranking: LLM-based generative query expansion (Gen-QR, Gen-PRF) and adaptive re-ranking over lexical corpus graphs (GAR) with cross-encoders (e.g., monoELECTRA) converge retrieval performance toward SOTA sparse models even in zero-shot regimes (Parry et al., 2 May 2024).

5. Data Quality, Annotation Practices, and Robustness

Test collection reusability and annotation practices are critical for credible benchmarking. TREC-DL emphasizes strict dev–test separation and advises one-shot evaluation to minimize selection and iteration bias (Craswell et al., 2021). Robustness analyses reveal that:

Short, noisy, and unjudged passages hinder neural retrieval model effectiveness on argument retrieval tasks (Touché 2020 subset). Filtering out non-argumentative, short documents (<20 words) and augmenting missing relevance judgments substantially improves neural models’ nDCG@10 by up to 0.52, though BM25 remains more effective (Thakur et al., 10 Jul 2024).
Pooling bias in biomedical datasets (e.g., TREC-COVID) complicates direct comparisons and may favor token-overlap models (Thakur et al., 2021).

6. Comparative Performance and Key Results

Task/Data	Best Lexical	Representative Neural	Hybrid/Prompt	Notes
TREC-DL (MSMARCO)	BM25	nnlm/BERT, SPLADE, Lion-SP-8B	PRP Few-shot, LLM-prompt	LLM-prompting now outperforms nnlm (Craswell et al., 10 Jul 2025)
BEIR (Zero-shot)	BM25	TAS-B, SPLADE, ColBERT	PRP Few-shot, L2G	BM25 robust, hybrid sparse+dense rising
Touché 2020	BM25 (0.367 nDCG@10)	TAS-B, SPLADE (<0.25)	PRP, L2G	BM25 excels due to length normalization (Thakur et al., 10 Jul 2024, Yoon et al., 1 Oct 2025)

Prompt-based and listwise reranking techniques are driving new state-of-the-art results, particularly with effective leveraging of LLM context and dynamic graph induction.

7. Future Directions and Open Questions

Benchmark evolution is guided by several axes:

Expansion to multilingual, long-document, and multi-field retrieval tasks (Thakur et al., 2021).
Addressing annotation selection bias and enriching pooling strategies for fair comparison across retrieval paradigms.
Adaptive example selection for in-context prompting and advancing scalable reranking/reranker-to-graph methods.
Storage and indexing optimization in large models via constant-size vector representations (MacAvaney et al., 2 Apr 2025).
Integration of generative relevance feedback as a core retrieval stage, with implications for data-scarce domains (Parry et al., 2 May 2024).

BEIR and TREC-DL, through methodological rigor and openness to hybrid neural, lexical, and non-parametric models, continue to set the benchmark for IR system evaluation and inspire best practices for reproducible, scalable research. As the field transitions toward ever larger LLMs and prompt-driven ranking, the nuanced interplay between data annotation, retrieval model selection, and evaluation protocols remains at the center of progress.