Retrieval Augmented Fine Tuning (RAFT)

Updated 31 October 2025

RAFT is a paradigm that fine-tunes retrievers and generators using synthetic and curated distractor data to boost retrieval fidelity and answer grounding.
It employs dual-model fusion with contrastive or supervised losses to balance specialized domain adaptation with general performance.
RAFT improves robustness and low-resource retrieval by simulating open-book QA scenarios with explicit relevance and chain-of-thought training.

Retrieval Augmented Fine Tuning (RAFT) is a paradigm designed to enhance the performance of information retrieval and retrieval-augmented LLMs, especially in contexts where labeled data is scarce, domains are specialized, or robustness to noisy context is required. RAFT extends the core ideas of Retrieval-Augmented Generation (RAG) by explicitly fine-tuning either the retriever, the generator, or both, in the presence of synthetic or real distractors, with the central goal of improving both in-domain retrieval fidelity and downstream answer grounding.

1. Core Definition and Rationale

RAFT is a post-pretraining adaptation strategy for retrieval-augmented neural pipelines (such as RAG), which consists of fine-tuning the retriever or generator models (or both) using supervised or contrastive losses, on data that reflects the open-book QA scenario encountered during inference. RAFT addresses two principal shortcomings of pretraining and naive retrieval methods:

Domain shift: Pretrained models often fail to capture domain-specific terminology or relationships.
Robustness: Real-world retrieval is imperfect; irrelevant or misleading ("distractor") documents are often returned.

RAFT resolves these via two complementary mechanisms:

Targeted fine-tuning using synthetic or curated data that simulates the actual domain retrieval challenge, including distractors.
Explicit supervision for relevance selection or chain-of-thought (CoT) explanations that teach models to focus on answer-supporting context and ignore noise.

In RAFT, the fine-tuning regime ensures models can (i) withstand noisy, incomplete, or adversarial retrieval, (ii) adapt to low-resource domains, and (iii) support explainability via explicit citation and reasoning.

2. Key Methodological Components

2.1 Synthetic Data Augmentation

Many RAFT approaches leverage synthetic data generation to overcome the scarcity of human-annotated relevance labels or QA pairs in new domains (Gupta et al., 2024). A typical pipeline includes:

Query Generation: For each domain document $d$ , an LLM generates $k$ synthetic queries, forming positive query-document pairs.
Contrastive Triplet Formation: For each query, hard negatives (semantically similar yet non-relevant) are selected using a base embedding model, yielding triplets $(q_i, d^+_i, D^-_i)$ .
Chain-of-Thought Data: For QA, answers are often generated in a multi-step CoT format, with explicit citations or quotes from the supporting document(s) (Zhang et al., 2024, Zhao et al., 2024).

2.2 Model Fusion and Fine-Tuning

Standard fine-tuning on domain data risks catastrophic forgetting of general capabilities. To counter this, model fusion strategies are introduced (Gupta et al., 2024):

Dual-Model Fusion: Combine the embeddings from a fine-tuned model and a frozen base model by linear interpolation:

$E^{\mathrm{fusion}}_{CLS} = \lambda E_{CLS} + (1-\lambda) E'_{CLS}$

where $\lambda$ controls the blend between new (domain) and old (general) distributions.

Contrastive or Supervised Loss: A contrastive InfoNCE loss or sequence-to-sequence NLL is computed over the fused embedding or generator output.

2.3 Distractor-Aware Training

A core innovation is training with a mixture of "oracle" (answer-containing) documents and distractor (irrelevant or noisy) documents. For some training fraction $P$ , the oracle is omitted to emulate the real-world absence of relevant retrieved context (Zhang et al., 2024).

Objective: The model learns to (i) extract answers only from relevant context when available, (ii) abstain or revert to parametric knowledge when not.
Explicit Citation: Targets often include verbatim context quotes and stepwise reasoning to force correct grounding and justification.

2.4 Retrievers and Joint vs. Independent Fine-Tuning

RAFT encompasses both retriever-side and generator-side adaptation. There are several strategies (Lawton et al., 2 Oct 2025):

Independent Fine-Tuning: The retriever and generator are each tuned separately, often using contrastive ranking loss (embedding) and NLL (generation).
Joint Fine-Tuning: End-to-end differentiable objectives update both modules simultaneously, usually with RAG-Token/RAG-Sequence-style marginal likelihood losses.
Model Fusion: Additional module merging (e.g., "cocktail" or mixup) may be applied to prevent domain overfitting.

The optimal strategy depends on the availability of (question, context) labels, computational constraints, and learning rate grid search needs.

3. Mathematical Formulation

The following mathematical framework captures the RAFT paradigm (Gupta et al., 2024, Zhang et al., 2024):

Embedding Fusion:

$E^{\text{fusion}}_{CLS} = \lambda E_{CLS} + (1-\lambda) E'_{CLS}$

Contrastive Loss:

$\mathcal{L}_{\text{contrast}} = -\log \frac{\exp(\text{sim}(E_{CLS}, E_{CLS}^+)/\tau)}{\exp(\text{sim}(E_{CLS}, E_{CLS}^+)/\tau) + \sum_{i=1}^{m} \exp(\text{sim}(E_{CLS}, E_{CLS}^{-i})/\tau)}$

where $\text{sim}$ is cosine similarity and $\tau$ is a temperature hyperparameter.

Seq2Seq SFT Objective:

$\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^T \log p(y_t|y_{<t}, x)$

for chain-of-thought answer targets $y$ given retrieval-augmented input $x$ .

Training Examples: For a fraction $P$ of the training set:

$(Q + D^* + D_2 + \ldots + D_k) \to A^*$

For the remaining $1-P$:

$(Q + D_1 + D_2 + ... + D_k) \to A^*$

where only distractors are provided, so the model may choose to abstain.

4. Empirical Results and Impact

RAFT and its variants have demonstrated consistent, often substantial improvements over both prompt-based RAG and classic supervised fine-tuning, across a diverse range of domains and tasks:

Synthetic Data Scarcity: Even with very limited fine-tuning data, synthetic augmentation enables fine-tuned models to outperform vanilla pretrained alternatives in domain-specific retrieval (Gupta et al., 2024).
Model Fusion: Blending domain and general embeddings preserves out-of-domain recall, mitigating catastrophic forgetting (Gupta et al., 2024).
QA/CoT Performance: Full RAFT, incorporating chain-of-thought and distractor training, yields strong improvements in F1, EM, and recall across PubMed, HotpotQA, SQUAD, and proprietary domains; e.g., up to +6.58% Recall@3 on SQUAD, +35.3% F1 on HotpotQA, with gains persisting in both in-domain and transfer settings (Gupta et al., 2024, Zhang et al., 2024, Zhao et al., 2024).
Robustness: Models trained with distractors generalize to imperfect, adversarial, and out-of-distribution retrieval settings; gains hold for both open-domain and highly-structured tabular/structured datasets.

Dataset	Baseline Recall@3	REFINE Recall@3	Relative Gain
TOURISM	0.884	0.937	+5.79%
SQUAD	0.866	0.923	+6.58%
RAG-12000	0.937	0.940	+0.32%

5. Broader Implications, Generalization, and Trade-offs

RAFT has established itself as a robust, generalizable approach for retrieval adaptation:

Low-Resource Adaptation: Synthetic data and fusion allow adaptation even with minimal labeled data, which is typical in proprietary enterprise and technical domains (Gupta et al., 2024).
Out-of-Domain Preservation: Model fusion strategies are specifically designed to prevent overfitting and maintain or enhance recall in "zero-shot" or transfer scenarios.
Workflow Integration: RAFT is model-agnostic—it can be applied to BERT-derived retrievers, transformer-based generators, and is compatible with vector databases, BM25 retrievers, and advanced fusion schemes.
Limitations: Effectiveness depends on the quality of synthetic QA/data and the careful balancing of fusion and distractor ratios. In settings where full gold context is always provided, naive fine-tuning may suffice, but in real-world retrieval, RAFT's design offers substantial practical advantages.

6. Relation to Other Variants and Ongoing Research

Variants and related concepts informed by RAFT principles include:

ALoFTRAG: Automated local RAFT, using only in-domain unlabeled data and LoRA fine-tuning, optimizing for privacy/security (Devine, 21 Jan 2025).
CRAFT: Compute-efficient RAFT, combining LoRA with RAFT for rapid, storage-efficient adapter switching and low-resource deployment (Chung et al., 2024).
RbFT: Robust Fine-Tuning, further extending RAFT to address adversarial or misleading (counterfactual) retrieval, via explicit detection and utility extraction tasks (Tu et al., 30 Jan 2025).
GraphRAFT: Application of RAFT for knowledge graphs, fine-tuning LLMs to generate provably correct Cypher/SPARQL queries for complex, multi-hop graph queries (Clemedtson et al., 7 Apr 2025).
SLIM-RAFT: A simplified RAFT for low-resource, cross-linguistic taxonomic tasks, demonstrating substantial enhancement with minimal model size (Oliveira et al., 2024).

RAFT has become a foundational methodology for domain adaptation and robustness in retrieval-augmented workflows, forming the methodological substrate for current best practices in both academia and industry.