Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 72 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 442 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Revela: Dense Retriever Learning via Language Modeling (2506.16552v1)

Published 19 Jun 2025 in cs.IR and cs.CL

Abstract: Dense retrievers play a vital role in accessing external and specialized knowledge to augment LMs. Training dense retrievers typically requires annotated query-document pairs, which are costly and hard to obtain in specialized domains such as code-motivating growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next-token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of LLMing to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via LLMing. Revela models semantic dependencies among documents by conditioning next-token prediction on both local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of LLMing. We evaluate Revela on both general-domain (BEIR) and domain-specific (CoIR) benchmarks across various retriever backbones. At a comparable parameter scale, Revela outperforms the previous best method with absolute improvements of 5.2 % (18.3 % relative) and 5.6 % (14.4 % relative) on NDCG@10, respectively, underscoring its effectiveness. Performance increases with model size, highlighting both the scalability of our approach and its promise for self-supervised retriever learning.