This paper introduces Self-Reflective Retrieval-Augmented Generation (Self-RAG), a framework designed to enhance the factual accuracy and overall quality of LLM generations without sacrificing their versatility. The core problem addressed is that standard LLMs often produce factual errors, and traditional Retrieval-Augmented Generation (RAG) methods, while helpful, retrieve information indiscriminately, which can hinder performance on tasks not requiring factual grounding or lead to outputs inconsistent with retrieved evidence.
Self-RAG trains an LLM to adaptively retrieve relevant text passages on-demand and to self-reflect on its own generated output using special "reflection tokens". These tokens are integrated into the generation process and fall into two categories:
- Retrieval Tokens (
[Retrieval]
): These tokens signal whether retrieving external information would be beneficial for generating the next segment of text. Values includeYes
,No
, orContinue
(to reuse previously retrieved evidence). - Critique Tokens (
[Critique]
): These tokens evaluate the quality of the generation process. There are three types:-
Is Relevant
([IsRel]
): Assesses if a retrieved passage is relevant to the query (Relevant
,Irrelevant
). -
Is Supported
([IsSup]
): Evaluates if the generated text segment is fully supported, partially supported, or not supported by the retrieved passage (Fully supported
,Partially supported
,No support
). -
Is Useful
([IsUse]
): Judges the overall usefulness or quality of the generated response segment on a scale (e.g., 1-5).
-
Inference Process:
- Given an input prompt and preceding text, the Self-RAG model first predicts a
[Retrieval]
token. - If
[Retrieval]=No
, it generates the next text segment like a standard LLM. - If
[Retrieval]=Yes
, it calls a retriever (R) to fetch relevant passages (). - It then processes multiple passages () in parallel:
- Predicts the relevance (
[IsRel]
) of each passage . - Generates a candidate output segment based on .
- Predicts the support level (
[IsSup]
) of given . - Predicts the overall usefulness (
[IsUse]
) of .
- Predicts the relevance (
- A segment-level beam search ranks the generated candidates () based on a weighted score combining the probabilities of the desired critique tokens (
Relevant
,Fully supported
,5
usefulness, etc.). This allows selecting the best-supported, most relevant, and useful continuation. - The process repeats for subsequent segments.
Training Process:
Self-RAG involves training two main components: a Critic model and the final Generator model (M).
- Critic Training:
- A dataset is created by prompting a powerful LLM (like GPT-4) with specific instructions to generate the desired reflection tokens for various inputs, outputs, and retrieved passages.
- A smaller LM (e.g., Llama2-7B) is fine-tuned on this dataset to act as the Critic model, learning to predict appropriate reflection tokens.
- Generator Training:
- The trained Critic model and a retriever (R) are used offline to augment a diverse instruction-following dataset. For each instance, the Critic inserts retrieval and critique tokens, along with relevant passages where needed, into the target output sequence.
- The Generator model (M) (e.g., Llama2-7B/13B) is then trained on this augmented corpus using a standard next-token prediction objective. The vocabulary is expanded to include the reflection tokens. The loss is masked for the actual retrieved text content. This teaches the Generator model to generate both the task output and the reflection tokens itself, eliminating the need for the separate Critic model during inference.
Key Features and Benefits:
- Adaptive Retrieval: Retrieves information only when deemed necessary, preserving the LLM's abilities on tasks not requiring external knowledge.
- Self-Correction/Critique: Explicitly evaluates relevance, factual grounding (support), and usefulness during generation.
- Controllability: The inference process can be customized by adjusting weights for different critique aspects (e.g., prioritizing factuality vs. fluency) or setting thresholds for retrieval frequency, without retraining.
- Improved Factuality and Citation: Generates outputs more faithful to retrieved evidence and provides better attribution through the
[IsSup]
token. - Efficiency: Training involves standard LM objectives after offline data augmentation, avoiding the complexities and costs of online reinforcement learning (like RLHF/PPO).
Experiments and Results:
Self-RAG (using Llama2 7B and 13B) was evaluated on various tasks, including open-domain QA (PopQA, TriviaQA), closed-set QA/reasoning (PubHealth, ARC-Challenge), and long-form generation with citation (Biography generation with FactScore, ALCE-ASQA).
- Self-RAG significantly outperformed standard Llama2 and Alpaca models, as well as conventional RAG approaches applied to these models.
- It outperformed ChatGPT and retrieval-augmented Llama2-chat on several tasks, particularly in factuality and citation accuracy on long-form generation.
- Ablation studies confirmed the benefits of adaptive retrieval, critique tokens, and the segment-level beam search guided by critiques.
- The framework demonstrated effective test-time customization by adjusting critique weights and retrieval thresholds.
In conclusion, Self-RAG presents a novel method for training LLMs to leverage retrieval more effectively and reflect on their outputs, leading to improved factuality, quality, and controllability compared to existing approaches.