I^3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval (2306.02371v3)
Abstract: Passage retrieval is a fundamental task in many information systems, such as web search and question answering, where both efficiency and effectiveness are critical concerns. In recent years, neural retrievers based on pre-trained LLMs (PLM), such as dual-encoders, have achieved huge success. Yet, studies have found that the performance of dual-encoders are often limited due to the neglecting of the interaction information between queries and candidate passages. Therefore, various interaction paradigms have been proposed to improve the performance of vanilla dual-encoders. Particularly, recent state-of-the-art methods often introduce late-interaction during the model inference process. However, such late-interaction based methods usually bring extensive computation and storage cost on large corpus. Despite their effectiveness, the concern of efficiency and space footprint is still an important factor that limits the application of interaction-based neural retrieval models. To tackle this issue, we incorporate implicit interaction into dual-encoders, and propose I3 retriever. In particular, our implicit interaction paradigm leverages generated pseudo-queries to simulate query-passage interaction, which jointly optimizes with query and passage encoders in an end-to-end manner. It can be fully pre-computed and cached, and its inference process only involves simple dot product operation of the query vector and passage vector, which makes it as efficient as the vanilla dual encoders. We conduct comprehensive experiments on MSMARCO and TREC2019 Deep Learning Datasets, demonstrating the I3 retriever's superiority in terms of both effectiveness and efficiency. Moreover, the proposed implicit interaction is compatible with special pre-training and knowledge distillation for passage retrieval, which brings a new state-of-the-art performance.
- InPars: Data Augmentation for Information Retrieval using Large Language Models. arXiv preprint arXiv:2202.05144 (2022).
- Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932 (2020).
- Layout-aware Webpage Quality Assessment. arXiv preprint arXiv:2301.12152 (2023).
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
- Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020).
- Zhuyun Dai and Jamie Callan. 2020. Context-aware term weighting for first stage passage retrieval. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 1533–1536.
- Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the eleventh ACM international conference on web search and data mining. 126–134.
- Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755 (2022).
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Incorporating Explicit Knowledge in Pre-trained Language Models for Passage Re-ranking. arXiv preprint arXiv:2204.11673 (2022).
- Qian Dong and Shuzi Niu. 2021a. Latent Graph Recurrent Network for Document Ranking. In Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part II 26. Springer, 88–103.
- Qian Dong and Shuzi Niu. 2021b. Legal judgment prediction via relational learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 983–992.
- Disentangled graph recurrent network for document ranking. Data Science and Engineering 7, 1 (2022), 30–43.
- SPLADE v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086 (2021).
- Luyu Gao and Jamie Callan. 2021. Unsupervised corpus aware language model pre-training for dense passage retrieval. arXiv preprint arXiv:2108.05540 (2021).
- COIL: Revisit exact lexical match in information retrieval with contextualized inverted list. arXiv preprint arXiv:2104.07186 (2021).
- A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM international on conference on information and knowledge management. 55–64.
- A deep look into neural ranking models for information retrieval. Information Processing & Management 57, 6 (2020), 102067.
- Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 113–122.
- Convolutional neural network architectures for matching natural language sentences. Advances in neural information processing systems 27 (2014).
- Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333–2338.
- Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969 (2019).
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
- Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
- Pretrained Language Model based Web Search Ranking: From Relevance to Satisfaction. arXiv preprint arXiv:2306.01599 (2023).
- SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval. arXiv preprint arXiv:2304.11370 (2023).
- Constructing Tree-based Index for Efficient and Effective Dense Retrieval. arXiv preprint arXiv:2304.11943 (2023).
- Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval. arXiv preprint arXiv:2208.04232 (2022).
- Embedding-based zero-shot retrieval through query generation. arXiv preprint arXiv:2009.10270 (2020).
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Zheng Liu and Yingxia Shao. 2022. Retromae: Pre-training retrieval-oriented transformers via masked auto-encoder. arXiv preprint arXiv:2205.12035 (2022).
- Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2780–2791.
- Ernie-search: Bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval. arXiv preprint arXiv:2205.09153 (2022).
- Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics 9 (2021), 329–345.
- Zero-shot neural passage retrieval via domain-targeted synthetic question generation. arXiv preprint arXiv:2004.14503 (2020).
- Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction. arXiv preprint arXiv:2204.10641 (2022).
- Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836.
- MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS.
- Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019).
- From doc2query to docTTTTTquery. Online preprint 6 (2019).
- Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424 (2019).
- Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019).
- Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 4 (2016), 694–707.
- Xipeng Qiu and Xuanjing Huang. 2015. Convolutional neural tensor network architecture for community-based question answering. In Twenty-Fourth international joint conference on artificial intelligence.
- RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5835–5847.
- Improving language understanding by generative pre-training. (2018).
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140 (2020), 1–67.
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. arXiv preprint arXiv:2110.07367 (2021).
- The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
- Automatic keyword extraction from individual documents. Text mining: applications and theory (2010), 1–20.
- ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734.
- A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management. 101–110.
- Improving document representations by generating pseudo query embeddings for dense retrieval. arXiv preprint arXiv:2105.03599 (2021).
- Attention is all you need. In Advances in neural information processing systems. 5998–6008.
- A deep architecture for semantic matching with multiple positional sentence representations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.
- Gpl: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. arXiv preprint arXiv:2112.07577 (2021).
- Simlm: Pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578 (2022).
- Query-as-context Pre-training for Dense Passage Retrieval. arXiv preprint arXiv:2212.09598 (2022).
- Contextual mask auto-encoder for dense passage retrieval. arXiv preprint arXiv:2208.07670 (2022).
- Social4Rec: Distilling User Preference from Social Graph for Video Recommendation in Tencent. arXiv preprint arXiv:2302.09971 (2023).
- T2Ranking: A large-scale Chinese Benchmark for Passage Ranking. arXiv preprint arXiv:2304.03679 (2023).
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020).
- A Unified Pretraining Framework for Passage Ranking and Expansion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4555–4563.
- Fast Semantic Matching via Flexible Contextualized Interaction. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1275–1283.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962 (2019).
- MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers. arXiv preprint arXiv:2212.07841 (2022).
- Pre-trained language model-based retrieval and ranking for Web search. ACM Transactions on the Web 17, 1 (2022), 1–36.
- Qian Dong (25 papers)
- Yiding Liu (30 papers)
- Qingyao Ai (113 papers)
- Haitao Li (65 papers)
- Shuaiqiang Wang (68 papers)
- Yiqun Liu (131 papers)
- Dawei Yin (165 papers)
- Shaoping Ma (39 papers)