Enhancing Robustness of Retrieval-Augmented LLMs with In-Context Learning
Overview
The research presented in the paper addresses the robustness of Retrieval-Augmented LLMs (RALMs) in open-domain question answering (QA) tasks, particularly when confronted with unanswerable queries and conflicting information. RALMs have enhanced QA performance by leveraging external knowledge but remain vulnerable when the external information is either incorrect or contradictory. This paper proposes an in-context learning-based approach to improve the robustness of RALMs without additional fine-tuning. The method involves incorporating Machine Reading Comprehension (MRC) demonstrations into the input of LLMs to improve their reasoning capabilities.
Methodology
The proposed approach is a method of augmenting RALMs with in-context learning by providing MRC examples, referred to as cases, alongside the queried context. The primary aims are:
- Enhancing the model’s ability to identify unanswerable queries.
- Detecting inconsistencies among retrieved contexts. The approach leverages the LLMs’ in-context learning capabilities, known for improving performance with minimal examples, to enable robust reasoning in imperfect retrieval scenarios.
Case Generation and Selection:
The authors create two types of cases:
- QA Cases: Extracted from the SQuAD dataset, providing simple context-question-answer triples.
- Conflict Cases: Crafted to simulate contradictions by substituting entities in the answer contexts, generating conflicting passages using the Llama3-70B-Instruct model for text generation, and SpaCy for Named Entity Recognition (NER).
Cases are retrieved using a case-based reasoning method that selects the most similar examples to the query based on sentence embeddings.
Experimental Setup
Datasets:
The paper utilizes the Natural Questions (NQ) and Web Questions (WebQ) datasets, with scenarios engineered to include unanswerable and conflicting contexts.
Prompts:
Custom instructions are used to query the models, asking them to either provide a direct answer, indicate unanswerability, or detect conflicts.
Metrics:
Accuracy metrics for both answerable and unanswerable examples are employed, with a limit on response length to prevent distortion due to verbosity.
Three LLMs—Llama3-70B-Instruct, Qwen-1.5-chat-72B, and GPT-3.5-turbo-0125—are evaluated comprehensively.
Findings
Unanswerable Scenarios:
Adding QA cases consistently improved the models' accuracies in identifying unanswerable queries:
- For instance, ChatGPT’s accuracy on unanswerable examples in NQ improved by 21.74, and in WebQ by 25.67 when QA cases were used.
- Llama3 showed continuous improvement with more QA cases provided, demonstrating the added robustness in handling unanswerable scenarios.
Conflict Scenarios:
Including conflict cases alongside QA cases enhanced the model's ability to detect conflicts:
- Qwen-1.5-chat-72B achieved the best performance in conflict detection with the addition of two QA and one conflict case (2Q+1C).
- The results indicate that specific conflict examples are essential for training LLMs to resolve contradictions effectively.
Case Retrieval:
For NQ’s unanswerable set, the selected retrieval method showed higher overall accuracy compared to random selection:
- The formulated retrieval method had up to 6 points higher accuracy in answerable examples.
Implications
This research demonstrates that integrating in-context learning through the provision of carefully selected MRC examples can substantially enhance the robustness of RALMs in open-domain QA tasks. The improvements in handling unanswerable and conflicting contexts indicate potential applications in real-world scenarios where data quality is inconsistent or partial.
Future Developments
Further exploration can extend this methodology to long-form QA and chaining techniques like Chain-of-Thought prompting, potentially leading to advancements in more nuanced and complex reasoning tasks. Additionally, expanding the types of cases and integrating broader context variations might improve the model's robustness further in various applications.
Conclusion
The paper illustrates that well-designed simple examples can substantially bolster an LLM's reasoning capabilities without additional fine-tuning. This indicates that in-context learning can be an effective mechanism for enhancing the robustness of RALMs in open-domain QA tasks. However, further research is necessary to explore the broader applications and potential limitations of this method.