Link, Synthesize, Retrieve: Universal Document Linking for Zero-Shot Information Retrieval
Abstract: Despite the recent advancements in information retrieval (IR), zero-shot IR remains a significant challenge, especially when dealing with new domains, languages, and newly-released use cases that lack historical query traffic from existing users. For such cases, it is common to use query augmentations followed by fine-tuning pre-trained models on the document data paired with synthetic queries. In this work, we propose a novel Universal Document Linking (UDL) algorithm, which links similar documents to enhance synthetic query generation across multiple datasets with different characteristics. UDL leverages entropy for the choice of similarity models and named entity recognition (NER) for the link decision of documents using similarity scores. Our empirical studies demonstrate the effectiveness and universality of the UDL across diverse datasets and IR models, surpassing state-of-the-art methods in zero-shot cases. The developed code for reproducibility is included in https://github.com/eoduself/UDL
- Umass at trec 2004: Novelty and hard. Computer Science Department Faculty Publication Series, page 189.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval, pages 716–722, Cham. Springer International Publishing.
- German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6788–6796, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
- Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
- ClearNLP. 2015. Constituent-to-dependency conversion. [Accessed: 2024-06-12].
- SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
- Together Computer. 2023. Redpajama-data: An open source recipe to reproduce llama training dataset. [Accessed: 2024-06-12].
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Common Crawl. 2007. Common crawl. [Accessed: 2024-06-12].
- Kornél Csernai. 2017. First quora dataset release: Question pairs. [Accessed: 2024-06-12].
- Climate-fever: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614.
- Christiane Fellbaum. 2005. Wordnet and wordnets. In Alex Barber, editor, Encyclopedia of Language and Linguistics, pages 2–665. Elsevier.
- From distillation to hard negative sampling: Making sparse neural ir models more effective. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pages 2353–2359.
- SimCSE: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
- Xinyang Geng and Hao Liu. 2023. Openllama: An open reproduction of llama. [Accessed: 2024-06-12].
- GENIA. 2007. Genia 1.0. [Accessed: 2024-06-12].
- Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 113–122.
- spaCy: Industrial-strength Natural Language Processing in Python. [Accessed: 2024-06-12].
- GAN-LM: Generative adversarial network using language models for downstream applications. In Proceedings of the 16th International Natural Language Generation Conference, pages 69–79, Prague, Czechia. Association for Computational Linguistics.
- EmbedTextNet: Dimension reduction with weighted reconstruction and correlation losses for efficient text embedding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9863–9879, Toronto, Canada. Association for Computational Linguistics.
- Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
- Query expansion by prompting large language models. arXiv preprint arXiv:2305.03653.
- Resources for brewing beir: Reproducible reference models and an official leaderboard. arXiv preprint arXiv:2306.07471.
- Multi-aspect dense retrieval. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3178–3186.
- Pretrained transformers for text ranking: Bert and beyond. Springer Nature.
- How to train your dragon: Diverse augmentation towards generalizable dense retrieval. arXiv preprint arXiv:2302.07452.
- Wanli: Worker and ai collaboration for natural language inference dataset creation. arXiv preprint arXiv:2201.05955.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Germanquad and germandpr: Improving non-english question answering and passage retrieval. arXiv preprint arXiv:2104.12741.
- Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
- Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
- Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037–1042.
- Spbertqa: A two-stage question answering system based on sentence transformers for medical texts. In International Conference on Knowledge Science, Engineering and Management, pages 371–382. Springer.
- OntoNotes. 2013. Ontonotes release 5.0. [Accessed: 2024-06-12].
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Shopping queries dataset: A large-scale esci benchmark for improving product search. arXiv preprint arXiv:2206.06588.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
- Timo Schick and Hinrich Schütze. 2021. Generating datasets with pretrained language models. arXiv preprint arXiv:2104.07540.
- Mpnet: Masked and permuted pre-training for language understanding. arXiv preprint arXiv:2004.09297.
- Pre-training with aspect-content text mutual prediction for multi-aspect dense retrieval. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4300–4304.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
- Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, Melbourne, Australia. Association for Computational Linguistics.
- Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics.
- Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9414–9423, Singapore. Association for Computational Linguistics.
- When do generative query and document expansions fail? a comprehensive study across methods, retrievers, and datasets. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1987–2003, St. Julian’s, Malta. Association for Computational Linguistics.
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808.
- mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
- Peilin Yang and Jimmy Lin. 2019. Reproducing and generalizing semantic term matching in axiomatic information retrieval. In Advances in Information Retrieval, pages 369–381, Cham. Springer International Publishing.
- Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307.
- Coco-dr: Combating distribution shifts in zero-shot dense retrieval with contrastive and distributionally robust learning. arXiv preprint arXiv:2210.15212.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International conference on machine learning, pages 11328–11339. PMLR.
- Mr. TyDi: A multi-lingual benchmark for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 127–137, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.