A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems (2406.14972v1)

Published 21 Jun 2024 in cs.CL and cs.IR

Abstract: Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by LLMs. The current common practices in RAG involve using "instructed" LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, "Seldom is a glance at the statistics enough to understand the meaning of the figures".

PDF HTML Abstract

Comparative Analysis of Base and Instruct LLMs in Retrieval Augmented Generation Systems

The paper "A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems" explores the comparative performance of base LLMs against their instruct variants within Retrieval-Augmented Generation (RAG) systems. The paper challenges a prevalent assumption in NLP—that instruct models, fine-tuned using supervised training and aligned with human preferences, outperform their base model counterparts in RAG application scenarios.

Overview of the Research

Retrieval-Augmented Generation (RAG) is an AI technique that combines retrieval operations with generative LLMs to improve the accuracy and contextual relevance of responses. RAG systems retrieve relevant documents from a pre-existing corpus to inform the generative process of LLMs, enhancing the coherence and factual accuracy of the generated text. With the growing reliance on RAG for applications like conversational AI and automated content generation, understanding the relative efficacy of different LLM variants becomes critical.

Key Findings

The pivotal finding of this paper is that base models, contrary to common expectations, outperform their instructed variants by 20% on average in a RAG setting. This discovery is substantiated through rigorous experimentation comparing base and instructed models across datasets like NQ-open and TriviaQA-unfiltered. The primary reasons for the superior performance of base models can be categorically analyzed as follows:

Accuracy: Across different experimental settings, base models demonstrated higher average accuracy. For instance, Llama 2's base model outperformed its instruct counterpart by 9.23 points on NQ and 17.88 points on TriviaQA on average. This trend continued across multiple model architectures, with few exceptions such as the Mistral 7B on NQ data.
Negative Rejection Rate: The paper reveals that base models rarely adhere to the instruction to respond with “NO-RES” when an answer is not present in the retrieved documents. Contrarily, instructed models displayed higher negative rejection rates, ensuring a more robust adherence to instructions that minimize hallucinated content.
Task Specific Instructions: When instructed models were tested with additional requirements like providing proofs, base models consistently outperformed or equaled their instructed counterparts by grounding their responses more accurately using retrieved documents. This further underscores the inherent strength of base models in generating contextually relevant answers.

Methodology

The paper meticulously details the methodologies employed for training and fine-tuning LLMs:

Pre-training: During pre-training, LLMs are trained on next-token prediction tasks, learning language patterns from vast text corpora.
Instruction Fine-Tuning: This phase involves training models on datasets paired with explicit instructions to enhance their usability in real-world contexts.
Alignment with Human Preferences: Techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are employed to ensure that model outputs align with human expectations and preferences.

In a practical setting, these refinement stages are critical for making models user-friendly, yet this paper crucially shows that such alignments might not always result in superior performance for RAG tasks.

Implications and Future Directions

The findings highlight a tradeoff between trustworthiness and performance in RAG systems. While instruct models adhere better to specific prompts and minimize hallucinations by following instructions like rejecting unanswerable questions, base models demonstrate superior accuracy by leveraging their parametric knowledge more effectively.

Practical Implications:

RAG Pipeline Optimization: RAG systems might benefit from employing base models despite their intrinsic tendencies to rely on parametric memory. Combining base models' raw effectiveness with strategic prompt engineering could lead to optimal performance.
Balance Between Trust and Function: Pragmatic application of LLMs in sensitive or high-stakes environments necessitates evaluating this balance. Developing new methods that enhance base models’ adherence to specific instructions could bridge this gap.

Theoretical Implications:

Rethinking Training Objectives: The paper suggests re-evaluating the additional alignment stages (like RLHF) emphasizing practical performance metrics over theoretical adherence to instructions.
Evaluation Metrics: New evaluation metrics that better capture the nuances of model performance in RAG settings are essential. For instance, metrics that simultaneously assess retrieval effectiveness, generative accuracy, and instruction adherence could provide a more comprehensive assessment.

Future Work:

Future research could involve:

Evaluating larger parameter models to investigate if similar trends hold at higher scales.
Applying the findings to other datasets and modalities beyond text, like multi-modal RAG systems integrating images.
Exploring hybrid models combining base models with selective instruct-level finetuning to optimize both accuracy and adherence.

In conclusion, this paper challenges prevailing assumptions about the superior performance of instruct LLMs in RAG tasks, highlighting the nuanced trade-offs between accuracy and trust. By fostering deeper discussions and prompting new research directions, this work contributes significantly to the ongoing development of more efficient and reliable AI systems.