Retrieval-Augmented Generation (RAG)

Updated 30 June 2025

Retrieval-Augmented Generation (RAG) is a hybrid model that integrates neural sequence generators with external dense retrieval systems to produce evidence-backed outputs.
It leverages both RAG-Sequence and RAG-Token strategies to dynamically condition generation on top retrieved passages, enhancing factual accuracy and diversity.
RAG improves scalability and updatability in NLP tasks by enabling independent corpus updates and providing clear provenance for its responses.

Retrieval-Augmented Generation (RAG) is an architectural paradigm in artificial intelligence that combines neural generative models—typically LLMs—with non-parametric retrieval systems that access external knowledge repositories at inference time. Unlike classical generative models whose factual knowledge and reasoning capacity are limited to their parametric memory, RAG systems can dynamically condition their outputs on retrieved, relevant, and up-to-date evidence, enabling both superior accuracy and provenance in knowledge-intensive NLP tasks.

1. Hybrid Model Architecture

RAG models are fundamentally hybrid, comprising two major components: a parametric generator (e.g., a pre-trained sequence-to-sequence transformer) and a non-parametric retriever. In the original formulation (Lewis et al., 2020), the architecture consists of:

Generator: A neural sequence-to-sequence model (BART-large, ~400M parameters) that performs the language generation, acting as a parametric memory.
Retriever: A Dense Passage Retriever (DPR), constructed from two BERT-based encoders:
- A learnable query encoder (mapping the input question or prompt to a dense vector).
- A fixed document encoder (mapping split passages from an external corpus—e.g., Wikipedia—to vectors).
Integration: At inference (and fine-tuning), the retriever computes similarity between the encoded query and a pre-indexed set of passage embeddings (via maximum inner product search, often implemented with FAISS), ranked as:

$p(z|x) \propto \exp\left(d(z)^\top q(x)\right)$

where $d(z)$ is a document embedding and $q(x)$ the query embedding.

The generator then conditions its output on the top-K retrieved passages, concatenating them with the input. During learning, the retriever and generator are trained jointly, except the document encoder for the index, which is fixed for stability.

2. Retrieval and Marginalization Strategies

RAG introduces and empirically evaluates two retrieval-marginalization variants:

2.1 RAG-Sequence

In the RAG-Sequence model, the entire output sequence $y$ is assumed to be conditioned upon a single retrieved document $z$ .

For the input $x$ , the generator samples top-K passages; each passage is paired with the input and the generator models $p(y|x, z)$ .
The model’s output probability distribution marginalizes over the retrieved set:

$p(y|x) \approx \sum_{z \in \text{top-}k}\! p(z|x)\ p(y|x, z)$

with generation factorized over output sequence length.

2.2 RAG-Token

In RAG-Token, each output token $y_i$ may be generated conditioned on a different latent document $z_i$ , thus supporting token-level provenance and evidence blending:

For each token,

$p(y|x)\! \approx \! \prod_{i=1}^N \left[\sum_{z \in \text{top-}k}\! p(z|x) p(y_i|x, z, y_{<i})\right]$

This formulation allows integration of information from multiple documents within a single response, increasing factual density and diversity.

3. Empirical Evaluation and Performance

The original RAG model was evaluated extensively on a suite of knowledge-intensive tasks:

Open-Domain QA: TriviaQA, Natural Questions, WebQuestions, CuratedTrec.
Abstractive QA: MS-MARCO (NLG).
Fact Verification: FEVER.
Question Generation: Jeopardy dataset.

RAG models consistently set state-of-the-art results on open-domain QA: | Model | NQ | TQA | WQ | CT | |---------------|------|------|-------|-------| | DPR | 41.5 | 57.9 | 41.1 | 50.6 | | RAG-Token | 44.1 | 66.1 | 45.5 | 50.0 | | RAG-Sequence | 44.5 | 68.0 | 45.2 | 52.2 |

For natural language generation on MS-MARCO, RAG-Sequence outperformed a BART baseline by 2.6 BLEU and 2.6 ROUGE-L points.

Qualitatively, RAG generations were found to be more factually grounded, specific, and diverse than closed-book models (BART, T5), and preferred in human studies for answer accuracy and specificity.

4. Comparative Advantages

RAG provides several notable advantages over prior modeling approaches:

Factuality and Hallucination Mitigation: By conditioning outputs on external evidence, RAG reduces model hallucination and improves answer correctness for out-of-domain or fresh knowledge.
Updateability: The knowledge corpus (e.g., Wikipedia) can be updated independently of model retraining, permitting rapid adaptation to new facts.
Interpretability and Attribution: Retrieved passages offer context evidence, supporting output provenance and transparency.
Unified Architecture: RAG subsumes both extractive and abstractive tasks in a single, general-purpose pipeline, contrasting with task-specific retrieve-then-extract pipelines (e.g., REALM), and supporting evidence blending across multiple documents.
Efficient Scalability: Dense retrieval with efficient libraries (e.g., FAISS) enables practical use at massive scale (~21 million passages).

5. Mathematical Formulation

The generative process for RAG can be formalized as:

Retriever Scoring:

$p(z|x) \propto \exp(d(z)^\top q(x))$

RAG-Sequence:

$p(y|x) \approx \sum_{z \in \text{top-}k} p(z|x) \prod_{i=1}^N p(y_i | x, z, y_{<i})$

RAG-Token:

$p(y|x) \approx \prod_{i=1}^N \left[\sum_{z \in \text{top-}k} p(z|x) p(y_i|x, z, y_{<i})\right]$

Negative Marginal Log-Likelihood (NLL) Objective:

$\mathcal{L} = -\sum_j \log p(y_j|x_j)$

6. Broader Implications and Research Context

RAG has established a paradigm shift in knowledge-intensive NLP, enabling neural models to produce grounded, up-to-date, and verifiable outputs in open-domain and specialty tasks. Its modularity facilitates easy swapping or updating of retrieval corpora, making it well-suited for applications requiring knowledge agility, traceability, and transparency. RAG’s hybrid design has spurred wide adoption across research and industry, forming the foundation for subsequent works on multi-hop reasoning, multimodal retrieval, and efficient grounding in large language systems.

The ongoing research landscape builds on the RAG approach, investigating advances such as adaptive retrieval triggering, end-to-end retriever-generator training, scalable multi-hop reasoning, and integrating structured and unstructured external data—all extensions of the principle that augmenting generative models with non-parametric memory leads to more robust, factual, and useful AI systems.

PDF Markdown Chat (Pro)

References (1)

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)

Follow Topic

Get notified by email when new papers are published related to Retrieval Augmented Generation (RAG).