Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ConTextual Masked Auto-Encoder for Dense Passage Retrieval (2208.07670v3)

Published 16 Aug 2022 in cs.CL and cs.AI

Abstract: Dense passage retrieval aims to retrieve the relevant passages of a query from a large corpus based on dense representations (i.e., vectors) of the query and the passages. Recent studies have explored improving pre-trained LLMs to boost dense retrieval performance. This paper proposes CoT-MAE (ConTextual Masked Auto-Encoder), a simple yet effective generative pre-training method for dense passage retrieval. CoT-MAE employs an asymmetric encoder-decoder architecture that learns to compress the sentence semantics into a dense vector through self-supervised and context-supervised masked auto-encoding. Precisely, self-supervised masked auto-encoding learns to model the semantics of the tokens inside a text span, and context-supervised masked auto-encoding learns to model the semantical correlation between the text spans. We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines, demonstrating the high efficiency of CoT-MAE. Our code is available at https://github.com/caskcsg/ir/tree/main/cotmae.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xing Wu (69 papers)
  2. Guangyuan Ma (14 papers)
  3. Meng Lin (4 papers)
  4. Zijia Lin (43 papers)
  5. Zhongyuan Wang (105 papers)
  6. Songlin Hu (80 papers)
Citations (22)

Summary

We haven't generated a summary for this paper yet.