Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Challenging Decoder helps in Masked Auto-Encoder Pre-training for Dense Passage Retrieval (2305.13197v1)

Published 22 May 2023 in cs.IR and cs.CL

Abstract: Recently, various studies have been directed towards exploring dense passage retrieval techniques employing pre-trained LLMs, among which the masked auto-encoder (MAE) pre-training architecture has emerged as the most promising. The conventional MAE framework relies on leveraging the passage reconstruction of decoder to bolster the text representation ability of encoder, thereby enhancing the performance of resulting dense retrieval systems. Within the context of building the representation ability of the encoder through passage reconstruction of decoder, it is reasonable to postulate that a ``more demanding'' decoder will necessitate a corresponding increase in the encoder's ability. To this end, we propose a novel token importance aware masking strategy based on pointwise mutual information to intensify the challenge of the decoder. Importantly, our approach can be implemented in an unsupervised manner, without adding additional expenses to the pre-training phase. Our experiments verify that the proposed method is both effective and robust on large-scale supervised passage retrieval datasets and out-of-domain zero-shot retrieval benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zehan Li (26 papers)
  2. Yanzhao Zhang (18 papers)
  3. Dingkun Long (23 papers)
  4. Pengjun Xie (85 papers)
Citations (3)