Sufficient Context: A New Lens on Retrieval Augmented Generation Systems (2411.06037v3)

Published 9 Nov 2024 in cs.CL

Abstract: Augmenting LLMs with context leads to improved performance across many applications. Despite much research on Retrieval Augmented Generation (RAG) systems, an open question is whether errors arise because LLMs fail to utilize the context from retrieval or the context itself is insufficient to answer the query. To shed light on this, we develop a new notion of sufficient context, along with a method to classify instances that have enough information to answer the query. We then use sufficient context to analyze several models and datasets. By stratifying errors based on context sufficiency, we find that larger models with higher baseline performance (Gemini 1.5 Pro, GPT 4o, Claude 3.5) excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not. On the other hand, smaller models with lower baseline performance (Mistral 3, Gemma 2) hallucinate or abstain often, even with sufficient context. We further categorize cases when the context is useful, and improves accuracy, even though it does not fully answer the query and the model errs without the context. Building on our findings, we explore ways to reduce hallucinations in RAG systems, including a new selective generation method that leverages sufficient context information for guided abstention. Our method improves the fraction of correct answers among times where the model responds by 2--10\% for Gemini, GPT, and Gemma. Key findings and the prompts used in our autorater analysis are available on our github.

Citations (4)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces sufficient context as a lens to determine if RAG errors stem from LLM limitations or from an inadequately informative context.
The paper presents an LLM-based autorater that achieves up to 93% accuracy with Gemini 1.5 Pro, offering a scalable method for labeling RAG benchmark data.
The paper shows that combining the sufficient context signal with self-rated confidence improves the accuracy-coverage trade-off, though fine-tuning alone does not fully reduce hallucinations.

This paper "Sufficient Context: A New Lens on Retrieval Augmented Generation Systems" (2411.06037) investigates a core challenge in Retrieval Augmented Generation (RAG) systems: understanding whether errors arise from the LLM's (LLM) inability to utilize provided context or from the context itself being insufficient to answer the query. The authors introduce the concept of "sufficient context" as a new lens to analyze RAG system performance and develop practical methods leveraging this concept.

The paper defines sufficient context for a given query and context pair $(Q, C)$ as existing if there is a plausible answer $A'$ to $Q$ derivable only from the information in $C$ . This definition is distinct from entailment, which checks if a given answer $A$ is supported by the context. The key difference is that sufficient context does not require knowing the ground truth answer beforehand. The definition accounts for multi-hop reasoning, ambiguous queries (if disambiguable by context), and ambiguous contexts (if disambiguable).

To practically apply this concept, the authors develop a "sufficient context autorater" using LLMs to classify query-context pairs. They evaluate several methods, finding that Gemini 1.5 Pro (1-shot) achieves high accuracy (93%) on a challenging human-labeled dataset without requiring a ground truth answer. FLAMe, a smaller model, is identified as a cheaper alternative with good performance (89.2% F1). This autorater allows for scalable labeling of large datasets, enabling detailed analysis.

Using the sufficient context autorater, the paper analyzes performance on standard RAG benchmarks: FreshQA, Musique-Ans, and HotPotQA. A key finding is that these datasets contain a significant fraction of instances with insufficient context (ranging from 22.6% to 55.4% depending on the dataset and context length). By stratifying model performance based on context sufficiency, the authors uncover nuanced model behaviors:

Adding RAG paradoxically reduces model abstention rates compared to closed-book settings.
Proprietary models (Gemini, GPT, Claude) perform much better with sufficient context (82.5% - 89.1% correct) but still hallucinate (12.7% - 14.3%).
With insufficient context, proprietary models often hallucinate (15.4% - 40.4%) rather than abstain (50.0% - 61.5%), though the split varies by model and dataset.
Open-source models (Gemma) show much higher hallucination rates, especially with insufficient context.
Surprisingly, models can still achieve notable correctness rates (35% - 62% in some cases) even with insufficient context, likely leveraging parametric knowledge or utilizing partial information from the context.

The paper qualitatively analyzes cases where models are correct despite insufficient context. These include scenarios with Yes/No or limited-choice questions, cases where context provides fragments or partial information combined with parametric knowledge, instances requiring many reasoning hops, ambiguous queries where the correct interpretation might be guessed, rater errors, or simply questions known from pre-training.

Building on these insights, the authors explore methods to reduce hallucinations using the sufficient context signal.

Selective Generation: They propose combining the binary sufficient context label with model self-rated confidence scores (P(True) or P(Correct)) in a simple logistic regression model to predict hallucination risk. This model's output is used to set a threshold for selective abstention.
- Implementation involves using FLAMe for the sufficient context label and model-specific methods for confidence scores.
- The approach allows for a controllable trade-off between coverage (proportion of questions answered) and selective accuracy (accuracy among answered questions).
- Results show that combining sufficient context with confidence generally improves the accuracy-coverage trade-off compared to using confidence alone, with gains up to 10% in accuracy for some models/datasets at high coverage levels.
Fine-tuning: They fine-tune Llama 3.1 8B and Mistral 3 7B using LoRA on different data mixtures. Some mixtures include instances where the ground truth answer is replaced with "I don't know", either randomly or specifically for instances labeled as having insufficient context by the autorater.
- The goal is to steer the model towards abstaining when uncertain or when context is insufficient.
- Results show that including "I don't know" in the training data increases abstention rates compared to standard fine-tuning. However, these fine-tuned models often still exhibit high hallucination rates (at least 31%) and do not consistently outperform vanilla RAG or closed-book in terms of reducing hallucinations or significantly increasing the correct-to-hallucinate ratio with insufficient context.

In summary, the paper introduces "sufficient context" as a valuable concept for analyzing RAG performance. It demonstrates how an LLM-based autorater can classify context sufficiency scalably. Their analysis reveals that current models struggle with hallucinations even with sufficient context and often fail to abstain when context is insufficient. Leveraging the sufficient context signal in a selective generation framework offers a practical method to improve the accuracy-coverage trade-off in RAG systems. Fine-tuning experiments suggest that simply labeling insufficient context examples with "I don't know" is not a silver bullet for reducing hallucinations in smaller models.

PDF Markdown

Follow-up Questions

Related Papers

Authors (6)

Tweets

https://twitter.com/joshclemm/status/1928970211769590080

https://twitter.com/CyrusRashtchian/status/1915339076124742061

https://twitter.com/LangChainJP/status/1929780300030050487

https://twitter.com/raghavan_anand/status/1928149512066900433

https://twitter.com/venelin_valkov/status/1928761384575041590

https://twitter.com/amazingguo/status/1928246339718074532

YouTube

Show All Videos

HackerNews

Sufficient Context: A New Lens on Retrieval Augmented Generation Systems (3 points, 0 comments)
New Lens on RAG Systems (1 point, 0 comments)