Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval (2108.05540v1)

Published 12 Aug 2021 in cs.IR and cs.CL

Abstract: Recent research demonstrates the effectiveness of using fine-tuned LLMs~(LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential. In this paper, we identify and address two underlying problems of dense retrievers: i)~fragility to training data noise and ii)~requiring large batches to robustly learn the embedding space. We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space. Retrieval experiments on MS-MARCO, Natural Question, and Trivia QA datasets show that coCondenser removes the need for heavy data engineering such as augmentation, synthesis, or filtering, as well as the need for large batch training. It shows comparable performance to RocketQA, a state-of-the-art, heavily engineered system, using simple small batch fine-tuning.

Unsupervised Corpus-Aware Pre-Training for Dense Passage Retrieval

This work by Gao and Callan addresses significant challenges in the training of dense passage retrieval systems, specifically issues of fragility to noise and dependence on large batch sizes. The paper introduces a novel method termed coCondenser that enhances the robustness and efficiency of dense passage retrieval models through unsupervised corpus-aware pre-training.

The authors leverage the Condenser pre-training architecture to address the fragility of models to training data noise. Condenser achieves this by conditioning LLM training on a CLS-specific task, thereby enhancing the capacity of the model to condense input information into dense vector representations robustly. However, an essential limitation of Condenser is its inability to establish semantically meaningful relationships within embedding spaces purely from unsupervised data.

To overcome these limitations, the authors propose integrating a corpus-level contrastive learning objective. The coCondenser method entails sampling text span pairs from documents in a given corpus and training the model such that spans from the same document yield similar CLS embeddings, while those from different documents remain distinct. This unsupervised approach warms up the embedding space, thereby obviating the need for complex engineering processes typically required for dense retrievers like RocketQA, which includes large batch training, denoising, and augmentation techniques.

Through experiments conducted on MS-MARCO, Natural Questions, and Trivia QA datasets, coCondenser demonstrated performance on par with and sometimes exceeding RocketQA—without extensive fine-tuning resources or sophisticated engineering techniques. Fine-tuning coCondenser merely required small batch training, which significantly reduces computational overhead.

Key Results

The empirical evaluation of coCondenser yields noteworthy results across diverse datasets. On MS-MARCO, coCondenser achieved an MRR@10 of 38.2 and a R@1000 of 98.4, outperforming many existing systems, including RocketQA and DPR-PAQ. On the Natural Questions dataset, coCondenser obtained significant results with a R@100 of 89.0. These findings underline the model's effectiveness in dense passage retrieval tasks.

Methodological Contributions

The paper's methodological contributions are underscored by the development of a pre-training paradigm that enhances the structural coherence of embedding spaces. By sidestepping the necessity for large-batch and data augmentation techniques, the authors enable resource-constrained academic institutions to build efficient retrieval systems. Moreover, they incorporate a memory-efficient gradient caching technique, allowing large batch emulation on limited GPU setups, which democratizes access to advanced retrieval tasks.

Implications and Future Directions

With coCondenser, the authors lay the groundwork for more accessible and efficient dense retrieval mechanisms that do not compromise on performance. The decoupling of pre-training from task-specific fine-tuning means that models can universally adapt to varied query types without task-specific adjustments.

Looking ahead, the coCondenser approach could inspire further research into the development of unsupervised or semi-supervised methodologies that can refine embedding coherence without significant resource demands. It might also prompt further investigations on uniting this unsupervised paradigm with other state-of-the-art methods to bolster retrieval efficiency and accuracy.

In conclusion, this paper presents substantial advances in unsupervised pre-training methodologies for dense passage retrieval, leveraging contrastive learning to establish semantically rich embedding spaces, and thus simplifying the fine-tuning processes while maintaining competitive performance with highly-engineered systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Luyu Gao (26 papers)
  2. Jamie Callan (43 papers)
Citations (305)