Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard (2306.07471v1)

Published 13 Jun 2023 in cs.IR and cs.CL

Abstract: BEIR is a benchmark dataset for zero-shot evaluation of information retrieval models across 18 different domain/task combinations. In recent years, we have witnessed the growing popularity of a representation learning approach to building retrieval models, typically using pretrained transformers in a supervised setting. This naturally begs the question: How effective are these models when presented with queries and documents that differ from the training data? Examples include searching in different domains (e.g., medical or legal text) and with different types of queries (e.g., keywords vs. well-formed questions). While BEIR was designed to answer these questions, our work addresses two shortcomings that prevent the benchmark from achieving its full potential: First, the sophistication of modern neural methods and the complexity of current software infrastructure create barriers to entry for newcomers. To this end, we provide reproducible reference implementations that cover the two main classes of approaches: learned dense and sparse models. Second, there does not exist a single authoritative nexus for reporting the effectiveness of different models on BEIR, which has led to difficulty in comparing different methods. To remedy this, we present an official self-service BEIR leaderboard that provides fair and consistent comparisons of retrieval models. By addressing both shortcomings, our work facilitates future explorations in a range of interesting research questions that BEIR enables.

Overview of "Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard"

This paper addresses two critical shortcomings in the BEIR benchmark ecosystem, which evaluates zero-shot retrieval models across diverse tasks and domains. The authors provide reproducible reference implementations for retrieval methods and introduce an official leaderboard, significantly enhancing accessibility and comparisons among different models.

Key Contributions

  1. Reproducible Implementations: The authors present reproducible reference implementations for five popular retrieval models within the Pyserini toolkit. These models represent both dense and sparse approaches, allowing researchers to conduct end-to-end retrieval runs with minimal setup effort.
  2. Official Leaderboard: An official, community-driven leaderboard for BEIR is introduced, replacing previous informal and static methods for result sharing. Hosted on the EvalAI platform, it allows for consistent and accurate comparisons across various models and datasets.
  3. Visualization Methodology: The paper also introduces radar charts for visually comparing models' effectiveness across datasets, highlighting gains and losses at a glance.
  4. Analysis of Model Variants: Experiments explore the effect of various approaches, such as multi-field indexing, sentence-based document segmentation, and hybrid model fusion, on model performance.

BEIR Benchmark

BEIR includes 18 datasets covering a wide range of tasks and domains, such as ad hoc retrieval, question answering, and fact-checking. The benchmark is designed to test out-of-distribution generalization, crucial for evaluating retrieval models' adaptability across different types of data.

Model Implementations

  • Dense Models: TAS-B and Contriever, both leveraging BERT-based architectures, are used to examine the effectiveness of dense semantic representation in retrieval tasks.
  • Sparse Models: SPLADE and uniCOIL explore sparse lexical representations using transformer networks.
  • BM25 Baseline: A strong lexical baseline using multi-field indexing for comparative analysis.

Main Findings

The paper reveals that dense and sparse models exhibit inconsistent zero-shot performance across BEIR datasets. SPLADE consistently outperforms on several datasets, but challenges remain, particularly in domain-specific collections such as BioASQ and TREC-COVID.

The radar chart visualizations effectively highlight these discrepancies, providing valuable insights into where each model excels or underperforms compared to the BM25 baseline.

Hybrid Models

Hybrid fusion of dense and sparse representations, particularly combining Contriever and SPLADE, shows promise in achieving robust performance across varied datasets. This approach capitalizes on the strengths of each model type, yielding consistent improvements over standalone models.

Conclusion and Future Work

The paper advances the utility of the BEIR benchmark by ensuring reproducibility and consistent result sharing across the research community. Future challenges include the need for standardized significance testing across aggregated datasets and exploring systematic biases in relevance judgments within BEIR.

Overall, these efforts encourage transparent and rigorous evaluations of retrieval models, fostering advancements in the field of information retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. arXiv:2010.00768 (2020).
  2. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3 (2018).
  3. Automatic Combination of Multiple Ranked Retrieval Systems. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994). Dublin, Ireland, 173–181.
  4. A Full-Text Learning to Rank Dataset for Medical Information Retrieval. In Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016). 716–722.
  5. James P. Callan. 1994. Passage-Level Evidence in Document Retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994). Dublin, Ireland, 302–310.
  6. SpaDE: Improving Sparse Representations Using a Dual Document Encoder for First-Stage Retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM 2022). Atlanta, Georgia, 272–282.
  7. Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019). Paris, France, 985–988.
  8. Aligning the Research and Practice of Building Search Applications: Elasticsearch and Pyserini. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining (WSDM 2022). 1573–1576.
  9. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv:2109.10086 (2021).
  10. From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. Madrid, Spain, 2353–2359.
  11. Luyu Gao and Jamie Callan. 2021. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. arXiv:2108.05540 (2021).
  12. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online, 3030–3042.
  13. Complementing Lexical Retrieval with Semantic Residual Embedding. In Proceedings of the 43rd European Conference on Information Retrieval (ECIR 2021), Part I. 146–160.
  14. Marti A. Hearst and Christian Plaunt. 1993. Subtopic Structuring for Full-Length Document Access. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1993). Pittsburgh, Pennsylvania, 56–68.
  15. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 113–122.
  16. Towards Unsupervised Dense Information Retrieval with Contrastive Learning. arXiv:2112.09118 (2021).
  17. Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic, 1016–1029.
  18. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2021), 535–547.
  19. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 6769–6781.
  20. Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020). 39–48.
  21. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics (2019).
  22. SLIM: Sparsified Late Interaction for Multi-Vector Retrieval with Inverted Indexes. arXiv:2302.06587 (2023).
  23. Jimmy Lin. 2021. A Proposed Conceptual Framework for a Representational Approach to Information Retrieval. SIGIR Forum 55, 2 (2021), 4:1–29.
  24. Jimmy Lin. 2022. Building a Culture of Reproducibility in Academic Research. arXiv:2212.13534 (2022).
  25. Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807 (2021).
  26. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 2356–2362.
  27. Pretrained Transformers for Text Ranking: BERT and Beyond. Morgan & Claypool Publishers.
  28. Document Expansions and Learned Sparse Lexical Representations for MS MARCO V1 and V2. In Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022). Madrid, Spain, 3187–3197.
  29. Another Look at DPR: Reproduction of Training and Replication of Retrieval. In Proceedings of the 44th European Conference on Information Retrieval (ECIR 2022), Part I. Stavanger, Norway, 613–626.
  30. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. arXiv:2101.05667 (2021).
  31. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online, 5835–5847.
  32. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic, 2825–2835.
  33. TREC-COVID: Rationale and Structure of an Information Retrieval Shared Task for COVID-19. Journal of the American Medical Informatics Association 27, 9 (2020), 1431–1436.
  34. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  35. Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models? arXiv:2201.11086 (2022).
  36. Retrieval of the Best Counterargument without Prior Topic Knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia, 241–251.
  37. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021).
  38. LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval. arXiv:2203.06169 (2022).
  39. EvalAI: Towards Better Evaluation Systems for AI Agents. arXiv:1902.03570 (2019).
  40. Anserini: Reproducible Ranking Baselines Using Lucene. Journal of Data and Information Quality 10, 4 (2018), Article 16.
  41. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM 2018). Torino, Italy, 497–506.
  42. Evaluating Interpolation and Extrapolation Performance of Neural Retrieval Models. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM 2022). Atlanta, Georgia, 2486–2496.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ehsan Kamalloo (17 papers)
  2. Nandan Thakur (24 papers)
  3. Carlos Lassance (35 papers)
  4. Xueguang Ma (36 papers)
  5. Jheng-Hong Yang (14 papers)
  6. Jimmy Lin (208 papers)
Citations (9)
Youtube Logo Streamline Icon: https://streamlinehq.com