Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration (2405.16546v2)

Published 26 May 2024 in cs.IR and cs.CL

Abstract: The proliferation of LLMs has led to an influx of AI-generated content (AIGC) on the internet, transforming the corpus of Information Retrieval (IR) systems from solely human-written to a coexistence with LLM-generated content. The impact of this surge in AIGC on IR systems remains an open question, with the primary challenge being the lack of a dedicated benchmark for researchers. In this paper, we introduce Cocktail, a comprehensive benchmark tailored for evaluating IR models in this mixed-sourced data landscape of the LLM era. Cocktail consists of 16 diverse datasets with mixed human-written and LLM-generated corpora across various text retrieval tasks and domains. Additionally, to avoid the potential bias from previously included dataset information in LLMs, we also introduce an up-to-date dataset, named NQ-UTD, with queries derived from recent events. Through conducting over 1,000 experiments to assess state-of-the-art retrieval models against the benchmarked datasets in Cocktail, we uncover a clear trade-off between ranking performance and source bias in neural retrieval models, highlighting the necessity for a balanced approach in designing future IR systems. We hope Cocktail can serve as a foundational resource for IR research in the LLM era, with all data and code publicly available at \url{https://github.com/KID-22/Cocktail}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Information retrieval meets large language models: A strategic report from chinese ir community. AI Open, 4:80–90.
  2. Managing ai risks in an era of rapid progress. arXiv preprint arXiv:2310.17688.
  3. Overview of touché 2020: argument retrieval. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings 11, pages 384–395. Springer.
  4. A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval: 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20–23, 2016. Proceedings 38, pages 716–722. Springer.
  5. Canyu Chen and Kai Shu. 2023. Can llm-generated misinformation be detected? arXiv preprint arXiv:2309.13788.
  6. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35.
  7. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
  8. Overview of the trec 2020 deep learning track.
  9. Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820.
  10. Unifying bias and unfairness in information retrieval: A survey of challenges and opportunities with large language models. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  11. Neural retrievers are biased towards llm-generated content. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
  13. Climate-fever: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614.
  14. Luyu Gao and Jamie Callan. 2022. Unsupervised corpus aware language model pre-training for dense passage retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2843–2853.
  15. Hans WA Hanley and Zakir Durumeric. 2023. Machine-made media: Monitoring the mobilization of machine-generated articles on misinformation and mainstream news websites. arXiv preprint arXiv:2305.09820.
  16. Dbpedia-entity v2: a test collection for entity search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1265–1268.
  17. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 113–122.
  18. Cqadupstack: A benchmark data set for community question-answering research. In Proceedings of the 20th Australasian document computing symposium, pages 1–8.
  19. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
  20. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781.
  21. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  22. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  23. Hang Li. 2022. Learning to rank for information retrieval and natural language processing. Springer Nature.
  24. Semantic matching in search. Foundations and Trends® in Information Retrieval, 7(5):343–469.
  25. How to train your dragon: Diverse augmentation towards generalizable dense retrieval. arXiv preprint arXiv:2302.07452.
  26. Instruction position matters in sequence generation with large language models. arXiv preprint arXiv:2308.12097.
  27. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  28. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
  29. Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
  30. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
  31. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  33. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
  34. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  35. Carlo Sansone and Giancarlo Sperlí. 2022. Legal information retrieval systems: State-of-the-art and open issues. Information Systems, 106:101967.
  36. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  37. Quill: Query intent with large language models using retrieval augmentation and multi-stage distillation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 492–501.
  38. Learning to tokenize for generative retrieval. In Thirty-seventh Conference on Neural Information Processing Systems.
  39. Is ChatGPT good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14918–14937.
  40. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  41. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  43. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
  44. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics.
  45. Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678.
  46. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.
  47. Retromae: Pre-training retrieval-oriented language models via masked auto-encoder. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 538–548.
  48. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808.
  49. Invisible relevance bias: Text-image retrieval models prefer ai-generated images. Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval.
  50. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
  51. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003.
  52. LLMaAA: Making large language models as active annotators. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13088–13103.
  53. Dense text retrieval based on pretrained language models: A survey. ACM Trans. Inf. Syst.
  54. A survey of large language models. arXiv preprint arXiv:2303.18223.
  55. Dynamicretriever: A pre-training model-based ir system with neither sparse nor dense index. arXiv preprint arXiv:2203.00537.
  56. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Sunhao Dai (22 papers)
  2. Weihao Liu (19 papers)
  3. Yuqi Zhou (31 papers)
  4. Liang Pang (94 papers)
  5. Rongju Ruan (5 papers)
  6. Gang Wang (407 papers)
  7. Zhenhua Dong (76 papers)
  8. Jun Xu (398 papers)
  9. Ji-Rong Wen (299 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.