LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval (2208.14754v2)

Published 31 Aug 2022 in cs.IR

Abstract: In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained LLMs, a crucial gap remains between LLMing and lexicon-weighting retrieval -- the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -- becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal LLMing encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.

Authors (8)

Tao Shen (87 papers)
Xiubo Geng (36 papers)
Chongyang Tao (61 papers)
Can Xu (98 papers)
Xiaolong Huang (29 papers)
Binxing Jiao (18 papers)
Linjun Yang (16 papers)
Daxin Jiang (138 papers)

Citations (32)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval (2208.14754v2)

Summary

Related Papers