Resources and Evaluations for Multi-Distribution Dense Information Retrieval (2306.12601v1)
Abstract: We introduce and define the novel problem of multi-distribution information retrieval (IR) where given a query, systems need to retrieve passages from within multiple collections, each drawn from a different distribution. Some of these collections and distributions might not be available at training time. To evaluate methods for multi-distribution retrieval, we design three benchmarks for this task from existing single-distribution datasets, namely, a dataset based on question answering and two based on entity matching. We propose simple methods for this task which allocate the fixed retrieval budget (top-k passages) strategically across domains to prevent the known domains from consuming most of the budget. We show that our methods lead to an average of 3.8+ and up to 8.0 points improvements in Recall@100 across the datasets and that improvements are consistent when fine-tuning different base retrieval models. Our benchmarks are made publicly available.
- Reasoning over Public and Private Data in Retrieval-Based Systems. Transactions of Computational Linguistics (TACL) (2023). https://arxiv.org/abs/2203.11027
- Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).
- A theory of learning from different domains. Machine learning 79 (2010), 151–175.
- SIGIR 2023 Workshop on Retrieval Enhanced Machine Learning (REML @ SIGIR 2023). In Proceedings of SIGIR. https://maroo.cs.umass.edu/pub/web/getpdf.php?id=1475
- Improving language models by retrieving from trillions of tokens. In arXiv:2112.04426v2.
- Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL).
- Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1870–1879. https://doi.org/10.18653/v1/P17-1171
- A Dataset for Answering Time-Sensitive Questions. NeurIPS (2021).
- HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data. In Findings of the Association for Computational Linguistics (EMNLP).
- The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Ravi Teja Gadde and Ivan Bulyko. 2021. Towards Continual Entity Learning in Language Models for Conversational Agents. arXiv preprint arXiv:2108.00082 (2021).
- Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research 17, 59 (2016), 1–35. http://jmlr.org/papers/v17/15-239.html
- Discovering emerging entities with ambiguous names. In Proceedings of the 23rd international conference on World wide web. 385–396.
- The knowledge awakens: Keeping knowledge bases fresh with emerging entities. In Proceedings of the 25th International Conference Companion on World Wide Web. 203–206.
- Towards Continual Knowledge Learning of Language Models. International Conference on Learning Representations (2022).
- Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
- Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment 3, 1-2 (2010), 484–493.
- PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. Transactions of the Association for Computational Linguistics 9 (2021), 1098–1115. https://doi.org/10.1162/tacl_a_00415
- Transferable adversarial training: A general approach to adapting deep classifiers. In International Conference on Machine Learning. PMLR, 4013–4022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Combining Open Domain Question Answering with a Task-Oriented Dialog System. In Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021). Association for Computational Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2021.dialdoc-1.5
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS) (2021).
- Ellen M Voorhees. 1999. The TREC-8 question answering track report. In TREC.
- Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval. International Conference on Learning Representations (2021).
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600 (2018).
- When Language Model Meets Private Library. arXiv:2210.17236 [cs.PL]
- Michael J. Q. Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating Extra-Linguistic Contexts into QA. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).