Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

REALM: Retrieval-Augmented Language Model Pre-Training (2002.08909v1)

Published 10 Feb 2020 in cs.CL and cs.LG
REALM: Retrieval-Augmented Language Model Pre-Training

Abstract: LLM pre-training has been shown to capture a surprising amount of world knowledge, crucial for NLP tasks such as question answering. However, this knowledge is stored implicitly in the parameters of a neural network, requiring ever-larger networks to cover more facts. To capture knowledge in a more modular and interpretable way, we augment LLM pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner, using masked LLMing as the learning signal and backpropagating through a retrieval step that considers millions of documents. We demonstrate the effectiveness of Retrieval-Augmented LLM pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA). We compare against state-of-the-art models for both explicit and implicit knowledge storage on three popular Open-QA benchmarks, and find that we outperform all previous methods by a significant margin (4-16% absolute accuracy), while also providing qualitative benefits such as interpretability and modularity.

Exploration of Retrieval-Augmented LLM Pre-Training (REALM)

Introduction

The paper presents a novel framework for augmenting LLM pre-training with a learned textual knowledge retriever, termed Retrieval-Augmented LLM Pre-Training (REALM). It introduces an unsupervised method for pre-training a knowledge retriever alongside the LLM. This contrasts with traditional LLMs like BERT, RoBERTa, and T5, which encapsulate knowledge implicitly within their parameters. REALM seeks to modularize knowledge storage, making it both interpretable and extensive, by leveraging external documents during prediction. The framework exhibits superior performance on Open-domain Question Answering (Open-QA) benchmarks, evidencing its capacity to effectively incorporate and leverage external world knowledge.

Background

The motivation behind REALM arises from the limitations of storage space within the network parameters of current LLMs. As these models are trained on extensive corpora, the encapsulated knowledge grows with network size, making it difficult to scale and interpret the stored information. The paper highlights the necessity for a more scalable and explicit method of knowledge storage and recall.

Approach

REALM decomposes the prediction of an output y given an input x into two distinct steps: retrieval and prediction. The framework uses a neural knowledge retriever to select relevant documents from a large corpus like Wikipedia and then employs a knowledge-augmented encoder to predict the output based on the input and retrieved documents. The model optimizes the marginal likelihood of this generative process, requiring adaptation of both the retriever and encoder through backpropagation. The real challenge and novel contribution lie in efficiently managing and backpropagating through the retrieval step, which involves a substantial corpus of millions of documents. This complexity is addressed through a sophisticated implementation utilizing Maximum Inner Product Search (MIPS) for efficient document retrieval and caching mechanisms.

Experiments and Results

REALM demonstrates outstanding performance when fine-tuned on Open-QA tasks, surpassing state-of-the-art models on popular benchmarks such as Natural Questions-Open, WebQuestions, and CuratedTrec, with improvements in absolute accuracy ranging between 4% to 16%. These results serve as a strong indicator of REALM's enhanced capability in incorporating and leveraging external knowledge effectively.

Implications and Future Directions

The demonstrated ability of REALM to utilize external documents in LLM pre-training suggests several promising directions for future research. The modular knowledge approach opens up possibilities for dynamic knowledge bases that can be updated without retraining the model from scratch, enhancing the model's adaptability to new information. Furthermore, the successful integration of retrieval mechanisms in not only the inference phase but also during pre-training paves the way for exploration in other domains such as structured knowledge bases, multimedia data, and multilingual corpora.

Another intriguing aspect is the model-centric unsupervised alignments between the pre-training corpus and the knowledge corpus. These alignments offer a new lens through which to analyze and interpret the interactions between learned representations and external knowledge sources.

Summary

In sum, Retrieval-Augmented LLM Pre-Training (REALM) marks a significant step forward in the unsupervised pre-training of LLMs. By combining the strengths of neural retrievers with the rich representational capabilities of modern LLMs, REALM not only pushes the boundaries of what's achievable in Open-QA but also opens new avenues for research in knowledge-intensive applications of AI. The framework's potential to leverage updated and diverse forms of external knowledge dynamically introduces a robust approach to tackling the challenges of scalability and adaptability in knowledge storage within neural networks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kelvin Guu (26 papers)
  2. Kenton Lee (40 papers)
  3. Zora Tung (4 papers)
  4. Panupong Pasupat (27 papers)
  5. Ming-Wei Chang (44 papers)
Citations (1,751)
Youtube Logo Streamline Icon: https://streamlinehq.com