Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Retrieval-Augmented Generation (RAG)

Updated 24 June 2025

Retrieval-Augmented Generation (RAG) is an architectural paradigm in artificial intelligence that combines neural generative models—typically LLMs—with non-parametric retrieval systems that access external knowledge repositories at inference time. Unlike classical generative models whose factual knowledge and reasoning capacity are limited to their parametric memory, RAG systems can dynamically condition their outputs on retrieved, relevant, and up-to-date evidence, enabling both superior accuracy and provenance in knowledge-intensive NLP tasks.

1. Hybrid Model Architecture

RAG models are fundamentally hybrid, comprising two major components: a parametric generator (e.g., a pre-trained sequence-to-sequence transformer) and a non-parametric retriever. In the original formulation (Lewis et al., 2020 ), the architecture consists of:

  • Generator: A neural sequence-to-sequence model (BART-large, ~400M parameters) that performs the language generation, acting as a parametric memory.
  • Retriever: A Dense Passage Retriever (DPR), constructed from two BERT-based encoders:
    • A learnable query encoder (mapping the input question or prompt to a dense vector).
    • A fixed document encoder (mapping split passages from an external corpus—e.g., Wikipedia—to vectors).
  • Integration: At inference (and fine-tuning), the retriever computes similarity between the encoded query and a pre-indexed set of passage embeddings (via maximum inner product search, often implemented with FAISS), ranked as:

p(zx)exp(d(z)q(x))p(z|x) \propto \exp\left(d(z)^\top q(x)\right)

where d(z)d(z) is a document embedding and q(x)q(x) the query embedding.

The generator then conditions its output on the top-K retrieved passages, concatenating them with the input. During learning, the retriever and generator are trained jointly, except the document encoder for the index, which is fixed for stability.

2. Retrieval and Marginalization Strategies

RAG introduces and empirically evaluates two retrieval-marginalization variants:

2.1 RAG-Sequence

In the RAG-Sequence model, the entire output sequence yy is assumed to be conditioned upon a single retrieved document zz.

  • For the input xx, the generator samples top-K passages; each passage is paired with the input and the generator models p(yx,z)p(y|x, z).
  • The model’s output probability distribution marginalizes over the retrieved set:

p(yx)ztop-k ⁣p(zx) p(yx,z)p(y|x) \approx \sum_{z \in \text{top-}k}\! p(z|x)\ p(y|x, z)

with generation factorized over output sequence length.

2.2 RAG-Token

In RAG-Token, each output token yiy_i may be generated conditioned on a different latent document ziz_i, thus supporting token-level provenance and evidence blending:

  • For each token,

p(yx) ⁣ ⁣i=1N[ztop-k ⁣p(zx)p(yix,z,y<i)]p(y|x)\! \approx \! \prod_{i=1}^N \left[\sum_{z \in \text{top-}k}\! p(z|x) p(y_i|x, z, y_{<i})\right]

This formulation allows integration of information from multiple documents within a single response, increasing factual density and diversity.

3. Empirical Evaluation and Performance

The original RAG model was evaluated extensively on a suite of knowledge-intensive tasks:

  • Open-Domain QA: TriviaQA, Natural Questions, WebQuestions, CuratedTrec.
  • Abstractive QA: MS-MARCO (NLG).
  • Fact Verification: FEVER.
  • Question Generation: Jeopardy dataset.

RAG models consistently set state-of-the-art results on open-domain QA: | Model | NQ | TQA | WQ | CT | |---------------|------|------|-------|-------| | DPR | 41.5 | 57.9 | 41.1 | 50.6 | | RAG-Token | 44.1 | 66.1 | 45.5 | 50.0 | | RAG-Sequence | 44.5 | 68.0 | 45.2 | 52.2 |

For natural language generation on MS-MARCO, RAG-Sequence outperformed a BART baseline by 2.6 BLEU and 2.6 ROUGE-L points.

Qualitatively, RAG generations were found to be more factually grounded, specific, and diverse than closed-book models (BART, T5), and preferred in human studies for answer accuracy and specificity.

4. Comparative Advantages

RAG provides several notable advantages over prior modeling approaches:

  • Factuality and Hallucination Mitigation: By conditioning outputs on external evidence, RAG reduces model hallucination and improves answer correctness for out-of-domain or fresh knowledge.
  • Updateability: The knowledge corpus (e.g., Wikipedia) can be updated independently of model retraining, permitting rapid adaptation to new facts.
  • Interpretability and Attribution: Retrieved passages offer context evidence, supporting output provenance and transparency.
  • Unified Architecture: RAG subsumes both extractive and abstractive tasks in a single, general-purpose pipeline, contrasting with task-specific retrieve-then-extract pipelines (e.g., REALM), and supporting evidence blending across multiple documents.
  • Efficient Scalability: Dense retrieval with efficient libraries (e.g., FAISS) enables practical use at massive scale (~21 million passages).

5. Mathematical Formulation

The generative process for RAG can be formalized as:

  • Retriever Scoring:

p(zx)exp(d(z)q(x))p(z|x) \propto \exp(d(z)^\top q(x))

  • RAG-Sequence:

p(yx)ztop-kp(zx)i=1Np(yix,z,y<i)p(y|x) \approx \sum_{z \in \text{top-}k} p(z|x) \prod_{i=1}^N p(y_i | x, z, y_{<i})

  • RAG-Token:

p(yx)i=1N[ztop-kp(zx)p(yix,z,y<i)]p(y|x) \approx \prod_{i=1}^N \left[\sum_{z \in \text{top-}k} p(z|x) p(y_i|x, z, y_{<i})\right]

  • Negative Marginal Log-Likelihood (NLL) Objective:

L=jlogp(yjxj)\mathcal{L} = -\sum_j \log p(y_j|x_j)

6. Broader Implications and Research Context

RAG has established a paradigm shift in knowledge-intensive NLP, enabling neural models to produce grounded, up-to-date, and verifiable outputs in open-domain and specialty tasks. Its modularity facilitates easy swapping or updating of retrieval corpora, making it well-suited for applications requiring knowledge agility, traceability, and transparency. RAG’s hybrid design has spurred wide adoption across research and industry, forming the foundation for subsequent works on multi-hop reasoning, multimodal retrieval, and efficient grounding in large language systems.

The ongoing research landscape builds on the RAG approach, investigating advances such as adaptive retrieval triggering, end-to-end retriever-generator training, scalable multi-hop reasoning, and integrating structured and unstructured external data—all extensions of the principle that augmenting generative models with non-parametric memory leads to more robust, factual, and useful AI systems.