Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval (2208.14754v2)

Published 31 Aug 2022 in cs.IR

Abstract: In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained LLMs, a crucial gap remains between LLMing and lexicon-weighting retrieval -- the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -- becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal LLMing encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tao Shen (87 papers)
  2. Xiubo Geng (36 papers)
  3. Chongyang Tao (61 papers)
  4. Can Xu (98 papers)
  5. Xiaolong Huang (29 papers)
  6. Binxing Jiao (18 papers)
  7. Linjun Yang (16 papers)
  8. Daxin Jiang (138 papers)
Citations (32)

Summary

We haven't generated a summary for this paper yet.