Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
148 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives (2402.14151v2)

Published 21 Feb 2024 in cs.IR, cs.AI, cs.CL, and cs.LG

Abstract: We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO). BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives. The benchmark's complexity and compact size make it suitable for evaluating LLM-based information retrieval systems. We present a modular framework for investigating factors that may influence LLM performance on retrieval tasks, and identify a simple baseline model which matches or outperforms existing approaches and more complex alternatives. No approach achieves satisfactory performance on all benchmark tasks, suggesting that stronger models and new retrieval protocols are necessary to address complex user needs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Task-aware retrieval with instructions. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3650–3675, Toronto, Canada. Association for Computational Linguistics.
  2. Overview of touché 2020: argument retrieval. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings 11, pages 384–395. Springer.
  3. DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4207–4218, Seattle, United States. Association for Computational Linguistics.
  4. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations.
  5. Climate-fever: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614.
  6. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161.
  7. Exaranker: Synthetic explanations improve neural rankers. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2409–2414.
  8. Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086.
  9. From distillation to hard negative sampling: Making sparse neural ir models more effective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2353–2359.
  10. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1762–1777, Toronto, Canada. Association for Computational Linguistics.
  11. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. New and improved embedding model. https://openai.com/blog/new-and-improved-embedding-model/. Accessed: 2023-06-03.
  13. Dbpedia-entity v2: A test collection for entity search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, page 1265–1268, New York, NY, USA. Association for Computing Machinery.
  14. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  15. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR.
  16. Sergey Ioffe. 2010. Improved consistent sampling, weighted minhash and l1 sketching. In 2010 IEEE international conference on data mining, pages 246–255. IEEE.
  17. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
  18. Bevan Koopman and Guido Zuccon. 2016. A test collection for matching patients to clinical trials. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 669–672.
  19. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
  20. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  21. Decomposing complex queries for tip-of-the-tongue retrieval. arXiv preprint arXiv:2305.15053.
  22. RankCSE: Unsupervised sentence representations learning via learning to rank. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13785–13802, Toronto, Canada. Association for Computational Linguistics.
  23. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  24. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319.
  25. Zero-shot listwise document reranking with a large language model. arXiv preprint arXiv:2305.02156.
  26. Www’18 open challenge: Financial opinion mining and question answering. In Companion Proceedings of the The Web Conference 2018, WWW ’18, page 1941–1942, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
  27. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
  28. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
  29. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
  30. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718.
  31. OpenAI. 2023. Gpt-4 technical report.
  32. Controlretriever: Harnessing the power of instructions for controllable retrieval. arXiv preprint arXiv:2308.10025.
  33. StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models.
  34. Together ai. https://www.together.ai/. https://www.together.ai/.
  35. Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563.
  36. Improving passage retrieval with zero-shot question generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3781–3797, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  37. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, Seattle, United States. Association for Computational Linguistics.
  38. Scirepeval: A multi-format benchmark for scientific document representations. arXiv preprint arXiv:2211.13308.
  39. One embedder, any task: Instruction-finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1102–1121, Toronto, Canada. Association for Computational Linguistics.
  40. Is chatgpt good at search? investigating large language models as re-ranking agent. arXiv preprint arXiv:2304.09542.
  41. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  42. RELiC: Retrieving evidence for literary claims. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7500–7518, Dublin, Ireland. Association for Computational Linguistics.
  43. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  44. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  46. Improvements to bm25 and language models examined. In Proceedings of the 2014 Australasian Document Computing Symposium, pages 58–65.
  47. Trec-covid: constructing a pandemic information retrieval test collection. SIGIR Forum, 54(1).
  48. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, Melbourne, Australia. Association for Computational Linguistics.
  49. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics.
  50. Doris-mae: Scientific document retrieval using multi-level aspect-based queries. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  51. Text embeddings by weakly-supervised contrastive pre-training.
  52. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
  53. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  54. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics, 11:1114–1131.
  55. Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance labels. arXiv preprint arXiv:2310.14122.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com