- The paper introduces sufficient context as a lens to determine if RAG errors stem from LLM limitations or from an inadequately informative context.
- The paper presents an LLM-based autorater that achieves up to 93% accuracy with Gemini 1.5 Pro, offering a scalable method for labeling RAG benchmark data.
- The paper shows that combining the sufficient context signal with self-rated confidence improves the accuracy-coverage trade-off, though fine-tuning alone does not fully reduce hallucinations.
This paper "Sufficient Context: A New Lens on Retrieval Augmented Generation Systems" (2411.06037) investigates a core challenge in Retrieval Augmented Generation (RAG) systems: understanding whether errors arise from the LLM's (LLM) inability to utilize provided context or from the context itself being insufficient to answer the query. The authors introduce the concept of "sufficient context" as a new lens to analyze RAG system performance and develop practical methods leveraging this concept.
The paper defines sufficient context for a given query and context pair (Q,C) as existing if there is a plausible answer A′ to Q derivable only from the information in C. This definition is distinct from entailment, which checks if a given answer A is supported by the context. The key difference is that sufficient context does not require knowing the ground truth answer beforehand. The definition accounts for multi-hop reasoning, ambiguous queries (if disambiguable by context), and ambiguous contexts (if disambiguable).
To practically apply this concept, the authors develop a "sufficient context autorater" using LLMs to classify query-context pairs. They evaluate several methods, finding that Gemini 1.5 Pro (1-shot) achieves high accuracy (93%) on a challenging human-labeled dataset without requiring a ground truth answer. FLAMe, a smaller model, is identified as a cheaper alternative with good performance (89.2% F1). This autorater allows for scalable labeling of large datasets, enabling detailed analysis.
Using the sufficient context autorater, the paper analyzes performance on standard RAG benchmarks: FreshQA, Musique-Ans, and HotPotQA. A key finding is that these datasets contain a significant fraction of instances with insufficient context (ranging from 22.6% to 55.4% depending on the dataset and context length). By stratifying model performance based on context sufficiency, the authors uncover nuanced model behaviors:
- Adding RAG paradoxically reduces model abstention rates compared to closed-book settings.
- Proprietary models (Gemini, GPT, Claude) perform much better with sufficient context (82.5% - 89.1% correct) but still hallucinate (12.7% - 14.3%).
- With insufficient context, proprietary models often hallucinate (15.4% - 40.4%) rather than abstain (50.0% - 61.5%), though the split varies by model and dataset.
- Open-source models (Gemma) show much higher hallucination rates, especially with insufficient context.
- Surprisingly, models can still achieve notable correctness rates (35% - 62% in some cases) even with insufficient context, likely leveraging parametric knowledge or utilizing partial information from the context.
The paper qualitatively analyzes cases where models are correct despite insufficient context. These include scenarios with Yes/No or limited-choice questions, cases where context provides fragments or partial information combined with parametric knowledge, instances requiring many reasoning hops, ambiguous queries where the correct interpretation might be guessed, rater errors, or simply questions known from pre-training.
Building on these insights, the authors explore methods to reduce hallucinations using the sufficient context signal.
- Selective Generation: They propose combining the binary sufficient context label with model self-rated confidence scores (P(True) or P(Correct)) in a simple logistic regression model to predict hallucination risk. This model's output is used to set a threshold for selective abstention.
- Implementation involves using FLAMe for the sufficient context label and model-specific methods for confidence scores.
- The approach allows for a controllable trade-off between coverage (proportion of questions answered) and selective accuracy (accuracy among answered questions).
- Results show that combining sufficient context with confidence generally improves the accuracy-coverage trade-off compared to using confidence alone, with gains up to 10% in accuracy for some models/datasets at high coverage levels.
- Fine-tuning: They fine-tune Llama 3.1 8B and Mistral 3 7B using LoRA on different data mixtures. Some mixtures include instances where the ground truth answer is replaced with "I don't know", either randomly or specifically for instances labeled as having insufficient context by the autorater.
- The goal is to steer the model towards abstaining when uncertain or when context is insufficient.
- Results show that including "I don't know" in the training data increases abstention rates compared to standard fine-tuning. However, these fine-tuned models often still exhibit high hallucination rates (at least 31%) and do not consistently outperform vanilla RAG or closed-book in terms of reducing hallucinations or significantly increasing the correct-to-hallucinate ratio with insufficient context.
In summary, the paper introduces "sufficient context" as a valuable concept for analyzing RAG performance. It demonstrates how an LLM-based autorater can classify context sufficiency scalably. Their analysis reveals that current models struggle with hallucinations even with sufficient context and often fail to abstain when context is insufficient. Leveraging the sufficient context signal in a selective generation framework offers a practical method to improve the accuracy-coverage trade-off in RAG systems. Fine-tuning experiments suggest that simply labeling insufficient context examples with "I don't know" is not a silver bullet for reducing hallucinations in smaller models.