Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval (2207.02578v2)

Published 6 Jul 2022 in cs.IR

Abstract: In this paper, we propose SimLM (Similarity matching with LLM pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced LLMing objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost. Our code and model check points are available at https://github.com/microsoft/unilm/tree/master/simlm .

Overview of SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

The paper introduces SimLM, a novel pre-training methodology targeted at enhancing dense passage retrieval. Dense retrieval has become a key component in information retrieval systems due to its ability to map queries and passages into a low-dimensional vector space, facilitating semantic comparison. SimLM proposes a straightforward yet efficient pre-training technique that leverages a representation bottleneck architecture.

Key Contributions

SimLM's architecture features a deep encoder and a shallow decoder tied together by a representation bottleneck, specifically the [CLS] vector. This bottleneck is central to compressing essential passage information, ensuring that the downstream retrieval tasks are effective when fine-tuning.

  • Replaced LLMing Objective: SimLM employs a replaced token detection strategy inspired by ELECTRA, which significantly increases sample efficiency. This method helps in bridging the gap between pre-training and fine-tuning, a common challenge in dense retrieval tasks.
  • Self-Supervised Pre-Training: The method does not rely on labeled data or queries, widening its applicability across various scenarios where labeled data is unavailable.
  • Performance Metrics: The paper reports substantial improvements over existing strong baselines like BM25 and multi-vector approaches such as ColBERTv2, across datasets including MS-MARCO and Natural Questions (NQ).

Experimental Results

The experimental validation of SimLM exhibits notable performance enhancements. On the MS-MARCO passage ranking dataset, SimLM achieves an MRR@10 of 41.1, outperforming models like ColBERTv2, which have significantly higher storage costs. This indicates SimLM's ability to effectively retain semantic information with greater storage efficiency. Similarly, on the NQ dataset, SimLM achieves R@20 of 85.2 and R@100 of 89.7.

Comparison with Existing Methods

SimLM's approach contrasts with other pre-training methods like Condenser and coCondenser by omitting skip connections between encoder and decoder layers. This excludes potential bypassing effects, compelling the bottleneck to encode all vital information. SimLM's replaced LLMing objective also offers superior gradient propagation compared to typical masked LLMing techniques.

Implications and Future Directions

The introduction of SimLM expands the potential for developing efficient dense retrieval systems, especially where query-labeled data is sparse. Its architecture can be seamlessly integrated into current retrieval pipelines without extensive modifications, suggesting broad applicability. The compact representation from the bottleneck leads to lower computational and storage costs, providing a practical edge in real-world applications.

Potential future work could explore scaling the model size and corpus to further leverage the capabilities of SimLM. Additionally, evaluating multilingual retrieval and zero-shot capabilities could open up new research avenues, given the method's inherent flexibility.

Conclusion

SimLM represents an advance in pre-training techniques for dense passage retrieval. It delivers improved retrieval quality and storage efficiency, offering substantial value to information retrieval systems. While certain limitations remain, including reliance on fine-tuning for optimal performance, SimLM sets a foundation for effective retrieval models across diverse environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Liang Wang (512 papers)
  2. Nan Yang (182 papers)
  3. Xiaolong Huang (29 papers)
  4. Binxing Jiao (18 papers)
  5. Linjun Yang (16 papers)
  6. Daxin Jiang (138 papers)
  7. Rangan Majumder (12 papers)
  8. Furu Wei (291 papers)
Citations (91)
Youtube Logo Streamline Icon: https://streamlinehq.com