Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
148 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Resources and Evaluations for Multi-Distribution Dense Information Retrieval (2306.12601v1)

Published 21 Jun 2023 in cs.IR and cs.AI

Abstract: We introduce and define the novel problem of multi-distribution information retrieval (IR) where given a query, systems need to retrieve passages from within multiple collections, each drawn from a different distribution. Some of these collections and distributions might not be available at training time. To evaluate methods for multi-distribution retrieval, we design three benchmarks for this task from existing single-distribution datasets, namely, a dataset based on question answering and two based on entity matching. We propose simple methods for this task which allocate the fixed retrieval budget (top-k passages) strategically across domains to prevent the known domains from consuming most of the budget. We show that our methods lead to an average of 3.8+ and up to 8.0 points improvements in Recall@100 across the datasets and that improvements are consistent when fine-tuning different base retrieval models. Our benchmarks are made publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Reasoning over Public and Private Data in Retrieval-Based Systems. Transactions of Computational Linguistics (TACL) (2023). https://arxiv.org/abs/2203.11027
  2. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).
  3. A theory of learning from different domains. Machine learning 79 (2010), 151–175.
  4. SIGIR 2023 Workshop on Retrieval Enhanced Machine Learning (REML @ SIGIR 2023). In Proceedings of SIGIR. https://maroo.cs.umass.edu/pub/web/getpdf.php?id=1475
  5. Improving language models by retrieving from trillions of tokens. In arXiv:2112.04426v2.
  6. Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL).
  7. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1870–1879. https://doi.org/10.18653/v1/P17-1171
  8. A Dataset for Answering Time-Sensitive Questions. NeurIPS (2021).
  9. HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data. In Findings of the Association for Computational Linguistics (EMNLP).
  10. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data.
  11. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  12. Ravi Teja Gadde and Ivan Bulyko. 2021. Towards Continual Entity Learning in Language Models for Conversational Agents. arXiv preprint arXiv:2108.00082 (2021).
  13. Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research 17, 59 (2016), 1–35. http://jmlr.org/papers/v17/15-239.html
  14. Discovering emerging entities with ambiguous names. In Proceedings of the 23rd international conference on World wide web. 385–396.
  15. The knowledge awakens: Keeping knowledge bases fresh with emerging entities. In Proceedings of the 25th International Conference Companion on World Wide Web. 203–206.
  16. Towards Continual Knowledge Learning of Language Models. International Conference on Learning Representations (2022).
  17. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  18. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
  19. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment 3, 1-2 (2010), 484–493.
  20. PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. Transactions of the Association for Computational Linguistics 9 (2021), 1098–1115. https://doi.org/10.1162/tacl_a_00415
  21. Transferable adversarial training: A general approach to adapting deep classifiers. In International Conference on Machine Learning. PMLR, 4013–4022.
  22. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  23. Combining Open Domain Question Answering with a Task-Oriented Dialog System. In Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021). Association for Computational Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2021.dialdoc-1.5
  24. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264
  25. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
  26. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS) (2021).
  27. Ellen M Voorhees. 1999. The TREC-8 question answering track report. In TREC.
  28. Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval. International Conference on Learning Representations (2021).
  29. HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600 (2018).
  30. When Language Model Meets Private Library. arXiv:2210.17236 [cs.PL]
  31. Michael J. Q. Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating Extra-Linguistic Contexts into QA. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com