Papers
Topics
Authors
Recent
Search
2000 character limit reached

Link, Synthesize, Retrieve: Universal Document Linking for Zero-Shot Information Retrieval

Published 24 Oct 2024 in cs.AI, cs.IR, and cs.LG | (2410.18385v2)

Abstract: Despite the recent advancements in information retrieval (IR), zero-shot IR remains a significant challenge, especially when dealing with new domains, languages, and newly-released use cases that lack historical query traffic from existing users. For such cases, it is common to use query augmentations followed by fine-tuning pre-trained models on the document data paired with synthetic queries. In this work, we propose a novel Universal Document Linking (UDL) algorithm, which links similar documents to enhance synthetic query generation across multiple datasets with different characteristics. UDL leverages entropy for the choice of similarity models and named entity recognition (NER) for the link decision of documents using similarity scores. Our empirical studies demonstrate the effectiveness and universality of the UDL across diverse datasets and IR models, surpassing state-of-the-art methods in zero-shot cases. The developed code for reproducibility is included in https://github.com/eoduself/UDL

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Umass at trec 2004: Novelty and hard. Computer Science Department Faculty Publication Series, page 189.
  2. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
  3. A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval, pages 716–722, Cham. Springer International Publishing.
  4. German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6788–6796, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  5. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
  6. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
  7. ClearNLP. 2015. Constituent-to-dependency conversion. [Accessed: 2024-06-12].
  8. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
  9. Together Computer. 2023. Redpajama-data: An open source recipe to reproduce llama training dataset. [Accessed: 2024-06-12].
  10. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  11. Common Crawl. 2007. Common crawl. [Accessed: 2024-06-12].
  12. Kornél Csernai. 2017. First quora dataset release: Question pairs. [Accessed: 2024-06-12].
  13. Climate-fever: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614.
  14. Christiane Fellbaum. 2005. Wordnet and wordnets. In Alex Barber, editor, Encyclopedia of Language and Linguistics, pages 2–665. Elsevier.
  15. From distillation to hard negative sampling: Making sparse neural ir models more effective. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pages 2353–2359.
  16. SimCSE: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  17. Xinyang Geng and Hao Liu. 2023. Openllama: An open reproduction of llama. [Accessed: 2024-06-12].
  18. GENIA. 2007. Genia 1.0. [Accessed: 2024-06-12].
  19. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 113–122.
  20. spaCy: Industrial-strength Natural Language Processing in Python. [Accessed: 2024-06-12].
  21. GAN-LM: Generative adversarial network using language models for downstream applications. In Proceedings of the 16th International Natural Language Generation Conference, pages 69–79, Prague, Czechia. Association for Computational Linguistics.
  22. EmbedTextNet: Dimension reduction with weighted reconstruction and correlation losses for efficient text embedding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9863–9879, Toronto, Canada. Association for Computational Linguistics.
  23. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
  24. Query expansion by prompting large language models. arXiv preprint arXiv:2305.03653.
  25. Resources for brewing beir: Reproducible reference models and an official leaderboard. arXiv preprint arXiv:2306.07471.
  26. Multi-aspect dense retrieval. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3178–3186.
  27. Pretrained transformers for text ranking: Bert and beyond. Springer Nature.
  28. How to train your dragon: Diverse augmentation towards generalizable dense retrieval. arXiv preprint arXiv:2302.07452.
  29. Wanli: Worker and ai collaboration for natural language inference dataset creation. arXiv preprint arXiv:2201.05955.
  30. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  31. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  32. Germanquad and germandpr: Improving non-english question answering and passage retrieval. arXiv preprint arXiv:2104.12741.
  33. Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
  34. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
  35. Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037–1042.
  36. Spbertqa: A two-stage question answering system based on sentence transformers for medical texts. In International Conference on Knowledge Science, Engineering and Management, pages 371–382. Springer.
  37. OntoNotes. 2013. Ontonotes release 5.0. [Accessed: 2024-06-12].
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  39. Shopping queries dataset: A large-scale esci benchmark for improving product search. arXiv preprint arXiv:2206.06588.
  40. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
  41. Timo Schick and Hinrich Schütze. 2021. Generating datasets with pretrained language models. arXiv preprint arXiv:2104.07540.
  42. Mpnet: Masked and permuted pre-training for language understanding. arXiv preprint arXiv:2004.09297.
  43. Pre-training with aspect-content text mutual prediction for multi-aspect dense retrieval. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4300–4304.
  44. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  45. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
  46. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  47. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
  48. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, Melbourne, Australia. Association for Computational Linguistics.
  49. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics.
  50. Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9414–9423, Singapore. Association for Computational Linguistics.
  51. When do generative query and document expansions fail? a comprehensive study across methods, retrievers, and datasets. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1987–2003, St. Julian’s, Malta. Association for Computational Linguistics.
  52. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808.
  53. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
  54. Peilin Yang and Jimmy Lin. 2019. Reproducing and generalizing semantic term matching in axiomatic information retrieval. In Advances in Information Retrieval, pages 369–381, Cham. Springer International Publishing.
  55. Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307.
  56. Coco-dr: Combating distribution shifts in zero-shot dense retrieval with contrastive and distributionally robust learning. arXiv preprint arXiv:2210.15212.
  57. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International conference on machine learning, pages 11328–11339. PMLR.
  58. Mr. TyDi: A multi-lingual benchmark for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 127–137, Punta Cana, Dominican Republic. Association for Computational Linguistics.

Summary

  • The paper introduces UDL, leveraging entropy-based similarity metrics and NER to link documents effectively in zero-shot IR scenarios.
  • It employs a dual strategy by dynamically choosing between TF-IDF and pre-trained language models to improve synthetic query generation across diverse domains.
  • Empirical results demonstrate that UDL outperforms state-of-the-art methods while ensuring computational and resource efficiency in multilingual settings.

Overview of Universal Document Linking for Zero-Shot Information Retrieval

The paper "Link, Synthesize, Retrieve: Universal Document Linking for Zero-Shot Information Retrieval" introduces a novel methodology to tackle the challenges inherent in zero-shot information retrieval (IR), especially when handling novel domains and languages with minimal associated query data. It presents the Universal Document Linking (UDL) algorithm, which aims to enhance synthetic query generation across diverse datasets by linking similar documents. This approach leverages entropy-based similarity models and named entity recognition (NER) for effective document linking.

Problem Statement

Zero-shot IR presents a notable challenge due to the lack of historical query data, particularly when transitioning across languages and domains. Traditional approaches have relied heavily on fine-tuning pre-trained dense retrieval (DR) models using synthetic queries. Nonetheless, these methods often suffer from significant performance degradation in zero-shot scenarios due to insufficient adaptation to new contexts.

Proposed Method

UDL proposes a dual strategy for improving zero-shot IR:

  1. Selection of Similarity Model: UDL selects the appropriate similarity model for each dataset based on term entropy derived from TF-IDF and pre-trained LLMs (LM). Entropy calculations help identify whether lexical (TF-IDF) or semantic (pre-trained LM) similarities are more applicable.
  2. Document Linking via NER: UDL leverages NER models to inform link decisions based on a similarity score. This method balances general and specialized NER models to effectively link documents by extracting keywords relevant to both general and domain-specific contexts.

Strong Numerical Results

The empirical evaluation demonstrates UDL's capability to outperform state-of-the-art methods in zero-shot IR scenarios across multiple datasets. Table \ref{tab:query_gen} reveals that UDL+QGen yields superior retrieval performance compared to other query augmentation strategies. The enhanced performance is particularly notable when using parameter-efficient models, indicating both computational and resource efficiency.

Implications and Speculations

The introduction of UDL has potential implications for both practical IR applications and theoretical advancements in AI:

  • Practical Impact: UDL provides a framework for effectively deploying retrieval systems in new languages and domains without extensive pre-adaptation. It allows smaller LMs to achieve competitive results, potentially democratizing access to high-performance IR systems.
  • Theoretical Implications: This work highlights the importance of document similarity beyond single-document-query pairings in zero-shot contexts. It opens avenues for further exploration into more flexible document linking and query generation methodologies, potentially enhancing the adaptability of retrieval models.
  • Future Developments: Refinement of UDL could involve dynamic creation of hyperparameters like entropy thresholds or integrating more advanced prompt techniques for LMs. Additionally, expanding UDL's applicability to a broader range of languages and scaling to larger datasets are promising directions.

Conclusion

The Universal Document Linking algorithm represents a significant advancement in addressing the limitations of zero-shot IR. By leveraging document linking strategies and sophisticated query generation, UDL enhances the adaptability and performance of IR models across varied contexts. Future work will likely build on these foundations to further refine and extend the applicability of UDL in diverse and challenging environments.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 19 likes about this paper.