Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Dual Encoders Are Generalizable Retrievers (2112.07899v1)

Published 15 Dec 2021 in cs.IR and cs.CL

Abstract: It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the bottleneck layer of a dual encoder, where the final score is simply a dot-product between a query vector and a passage vector, is too limited to make dual encoders an effective retrieval model for out-of-domain generalization. In this paper, we challenge this belief by scaling up the size of the dual encoder model {\em while keeping the bottleneck embedding size fixed.} With multi-stage training, surprisingly, scaling up the model size brings significant improvement on a variety of retrieval tasks, especially for out-of-domain generalization. Experimental results show that our dual encoders, \textbf{G}eneralizable \textbf{T}5-based dense \textbf{R}etrievers (GTR), outperform %ColBERT~\cite{khattab2020colbert} and existing sparse and dense retrievers on the BEIR dataset~\cite{thakur2021beir} significantly. Most surprisingly, our ablation study finds that GTR is very data efficient, as it only needs 10\% of MS Marco supervised data to achieve the best out-of-domain performance. All the GTR models are released at https://tfhub.dev/google/collections/gtr/1.

The concept of "Large Dual Encoders Are Generalizable Retrievers" refers to the use of dual-encoder architectures in information retrieval systems, particularly focusing on how these large models generalize effectively across different retrieval tasks. Let’s break down and explore this statement in detail, illustrating its significance in the context of modern retrieval systems.

Dual Encoder Architecture

A dual encoder system consists of two separate neural networks (encoders) that independently encode queries and documents (or other items to be retrieved) into fixed-size embeddings. These embeddings are then compared (often via vector similarity measures like cosine similarity) to find the most relevant documents for a given query.

Key Features:

  1. Independence: Queries and documents are encoded independently, which allows for pre-computation of document embeddings, significantly speeding up the retrieval process.
  2. Scalability: This architecture is highly scalable because it simplifies the matching process to a series of vector operations.
  3. Flexibility: Dual encoders can be applied to various types of retrieval tasks, from text-to-text to cross-modal retrieval (e.g., text-to-image).

Large Dual-Encoders in Retrieval

1. Generalization Capability

Large dual encoders, particularly those implemented with transformer-based models like BERT or other large-scale architectures, have demonstrated strong generalization abilities. This means they can perform well across diverse datasets and retrieval scenarios after being trained on large-scale, task-agnostic data. This ability stems from the inherent representational capacity of large models trained with massive amounts of diverse data.

2. Training and Fine-Tuning

Training large dual encoders typically involves pre-training on massive corpora with a task like masked LLMing (for text) or contrastive learning (for cross-modal tasks). Fine-tuning is then performed on specific retrieval datasets to further adapt the encoders to the nuances of the retrieval task at hand.

3. In-batch Negatives and Hard Negatives

Techniques such as using in-batch negatives (utilizing other samples in the batch as negative examples) and hard negatives (carefully selected challenging negative samples) during training have been vital in improving the performance of dual encoder models. These techniques optimize the models to better distinguish between closely related queries and documents, enhancing their generalization and retrieval accuracy.

Practical Implications

Dual encoders offer several practical benefits in retrieval tasks:

  • Efficiency: Queries and documents are encoded independently, enabling the use of efficient search structures like Approximate Nearest Neighbor (ANN) indices.
  • Pre-computation: Document embeddings can be pre-computed and stored, allowing for real-time retrieval by simply encoding the query and performing a fast similarity search over pre-computed embeddings.
  • Robustness: Large dual encoders often exhibit robustness to variations in query and document phrasing, making them effective across different datasets and domains.

Example: NV-Embed

An illustrative example of advancements in this area is the NV-Embed model. By leveraging a large decoder-only transformer architecture and introducing innovative elements like latent attention layers and specialized training regimes, NV-Embed achieves state-of-the-art performance on various retrieval benchmarks (Lee et al., 27 May 2024 ). This model exemplifies how large-scale dual encoders can set new standards in retrieval tasks through sophisticated architectural and training strategies.

Conclusion

The statement "Large Dual Encoders Are Generalizable Retrievers" encapsulates a significant trend in modern retrieval systems. Large dual encoders, with their powerful representational capabilities and efficient retrieval process, demonstrate substantial generalization across various retrieval tasks. Their independent encoding mechanism, coupled with advanced training techniques, allows them to offer high performance and scalability, making them a preferred choice in contemporary information retrieval applications.

These models push the boundaries of what's possible in terms of efficient, scalable, and generalizable retrieval systems, heralding a new era of advancements in the field of information retrieval.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Jianmo Ni (31 papers)
  2. Chen Qu (37 papers)
  3. Jing Lu (158 papers)
  4. Zhuyun Dai (26 papers)
  5. Ji Ma (72 papers)
  6. Vincent Y. Zhao (8 papers)
  7. Yi Luan (25 papers)
  8. Keith B. Hall (3 papers)
  9. Ming-Wei Chang (44 papers)
  10. Yinfei Yang (73 papers)
  11. Gustavo Hernández Ábrego (5 papers)
Citations (380)
X Twitter Logo Streamline Icon: https://streamlinehq.com