Dense Text Retrieval based on Pretrained Language Models: A Survey (2211.14876v1)

Published 27 Nov 2022 in cs.IR

Abstract: Text retrieval is a long-standing research topic on information seeking, where a system is required to return relevant information resources to user's queries in natural language. From classic retrieval methods to learning-based ranking functions, the underlying retrieval models have been continually evolved with the ever-lasting technical innovation. To design effective retrieval models, a key point lies in how to learn the text representation and model the relevance matching. The recent success of pretrained LLMs (PLMs) sheds light on developing more capable text retrieval approaches by leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can effectively learn the representations of queries and texts in the latent representation space, and further construct the semantic matching function between the dense vectors for relevance modeling. Such a retrieval approach is referred to as dense retrieval, since it employs dense vectors (a.k.a., embeddings) to represent the texts. Considering the rapid progress on dense retrieval, in this survey, we systematically review the recent advances on PLM-based dense retrieval. Different from previous surveys on dense retrieval, we take a new perspective to organize the related work by four major aspects, including architecture, training, indexing and integration, and summarize the mainstream techniques for each aspect. We thoroughly survey the literature, and include 300+ related reference papers on dense retrieval. To support our survey, we create a website for providing useful resources, and release a code repertory and toolkit for implementing dense retrieval models. This survey aims to provide a comprehensive, practical reference focused on the major progress for dense text retrieval.

PDF Abstract

An Overview of Dense Text Retrieval based on Pretrained LLMs

The rapid evolution of dense text retrieval models, primarily driven by recent advancements in Pretrained LLMs (PLMs), marks a significant development in the field of information retrieval. The paper "Dense Text Retrieval based on Pretrained LLMs: A Survey" by Wayne Xin Zhao et al. comprehensively examines the state-of-the-art techniques for dense retrieval, focusing on the integration of PLMs and discussing their profound impact on retrieval system architectures, training methodologies, indexing mechanisms, and integration with reranking pipelines.

Key Contributions and Framework

This survey takes a structured approach by dissecting the dense retrieval problem into four fundamental aspects: architecture, training, indexing, and integration. This allows for a detailed exploration of the significant progress made across each dimension, reflecting the intricate developments in PLM-based retrieval systems.

Architectural Advances: Extensive emphasis is placed on the architectural evolution of dense retrieval models. The discussion elaborates on two predominant approaches, bi-encoder and cross-encoder architectures, detailing how dense retrieval has leveraged these PLM-based frameworks to capture semantic interactions and improve efficiency. The distinction between single-representation and multi-representation models, as well as phrase-level representation, illustrates the nuanced approaches taken to enhance retrieval performance.
Training Strategies: The paper illustrates the challenges and solutions in training dense retrieval models. It explores various strategies such as negative sampling, data augmentation through knowledge distillation, and task-adaptive pretraining. Methods to overcome the challenges of large-scale candidate space and limited relevance judgments are reviewed, highlighting sophisticated techniques like dynamic hard negative sampling and representation-enhanced pretraining.
Indexing Mechanisms: Traditional sparse retrieval methods rely heavily on inverted indexes, but dense retrieval necessitates more advanced indexing strategies to manage dense vector spaces. The paper discusses Approximate Nearest Neighbor Search (ANNS) for efficient retrieval and introduces product quantization techniques to reduce computational overheads, underscoring the shift from term-based indexing to vector-based retrieval.
Integration with Reranking: In an operational retrieval system, the coordination between retrieval and reranking stages is crucial. The survey evaluates various integration strategies, including pipeline, adaptive, and joint training methodologies, emphasizing the intricate balance between retrieval effectiveness and efficiency in real-world applications.

Empirical Evaluation and Implications

The empirical results presented in the paper, particularly using the RocketQA framework, underscore the quantitative impact of different retrieval optimizations. Techniques such as cross-batch negatives, denoised hard negatives, and advanced training schemes demonstrate significant improvements in performance metrics, proving their merit in practical deployments.

Practical and Theoretical Implications

The paper not only highlights strong empirical results but also provokes thought on the theoretical understanding of dense retrieval systems. The paper raises pertinent questions about the capacity and limitations of dense retrieval, inviting further exploration into the theoretical frameworks and axiomatic analysis of PLM-based models.

Future Directions and Research Opportunities

The survey identifies several promising avenues for future research:

Enhanced Zero-shot Retrieval: The pursuit of improving retrieval across diverse and unseen domains remains an area ripe for innovation. Strategies such as cross-lingual retrieval and domain adaptation present ongoing challenges.
Combining Sparse and Dense Methods: Exploring hybrid models that effectively integrate the strengths of both retrieval paradigms can yield more robust retrieval solutions.
Theoretical Foundations: Further exploration into the theoretical underpinnings of dense retrieval can bridge the existing knowledge gaps regarding model behavior and performance.

Conclusion

Overall, this survey captures the essence of dense retrieval systems' advancements, providing a valuable resource for researchers in the field. By addressing the diverse challenges and solutions across architectural, training, indexing, and integration facets, the paper lays a comprehensive foundation for continued research and development in dense text retrieval using PLMs. This domain stands at a crossroads of innovation, poised to deliver even more impactful applications in parsing and understanding vast troves of textual data.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Wayne Xin Zhao (196 papers)
Jing Liu (525 papers)
Ruiyang Ren (18 papers)
Ji-Rong Wen (299 papers)

Citations (132)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - RUCAIBox/DenseRetrieval (213 stars)

Tweets

https://twitter.com/MtBarta/status/1761528387896848533