Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval (2206.03281v1)

Published 7 Jun 2022 in cs.IR

Abstract: Recent research demonstrates the effectiveness of using pretrained LLMs (PLM) to improve dense retrieval and multilingual dense retrieval. In this work, we present a simple but effective monolingual pretraining task called contrastive context prediction~(CCP) to learn sentence representation by modeling sentence level contextual relation. By pushing the embedding of sentences in a local context closer and pushing random negative samples away, different languages could form isomorphic structure, then sentence pairs in two different languages will be automatically aligned. Our experiments show that model collapse and information leakage are very easy to happen during contrastive training of LLM, but language-specific memory bank and asymmetric batch normalization operation play an essential role in preventing collapsing and information leakage, respectively. Besides, a post-processing for sentence embedding is also very effective to achieve better retrieval performance. On the multilingual sentence retrieval task Tatoeba, our model achieves new SOTA results among methods without using bilingual data. Our model also shows larger gain on Tatoeba when transferring between non-English pairs. On two multi-lingual query-passage retrieval tasks, XOR Retrieve and Mr.TYDI, our model even achieves two SOTA results in both zero-shot and supervised setting among all pretraining models using bilingual data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ning Wu (63 papers)
  2. Yaobo Liang (29 papers)
  3. Houxing Ren (16 papers)
  4. Linjun Shou (53 papers)
  5. Nan Duan (172 papers)
  6. Ming Gong (246 papers)
  7. Daxin Jiang (138 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.