Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
99 tokens/sec
Gemini 2.5 Pro Premium
56 tokens/sec
GPT-5 Medium
26 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
106 tokens/sec
DeepSeek R1 via Azure Premium
99 tokens/sec
GPT OSS 120B via Groq Premium
507 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries (2406.12824v1)

Published 18 Jun 2024 in cs.CL and cs.AI

Abstract: Retrieval Augmented Generation (RAG) enriches the ability of LLMs to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of LLMs in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn't clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that LLMs take shortcut and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in LLMs with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced shortcut behaviour true across both LLaMa and Phi family of models.

Citations (3)

Summary

  • The paper demonstrates that language models prioritize external RAG context, showing a fivefold reduction in reliance on internal parametric memory for factual queries.
  • It employs causal tracing and attention knockout analyses to reveal that the last token’s residual stream is enriched more by external context than by subject tokens.
  • The study provides actionable insights for QA system design by quantifying the interplay between retrieval augmentation and internal model knowledge.

Mechanistic Examination of RAG in LLMs for Factual Queries

The paper "From RAGs to rich parameters: Probing how LLMs utilize external knowledge over parametric information for factual queries" explores the interaction between internal model knowledge and externally provided context through Retrieval Augmented Generation (RAG). This research aims to reveal how LLMs leverage retrieved context as opposed to their parametric memories when handling factual queries. This paper employs robust analytical methods, including Causal Mediation Analysis, Attention Contributions, and Knockouts, to scrutinize the mechanistic behavior of LLMs like LLaMa-2 and Phi-2.

Overview

The authors set out with a foundational observation that LLMs have an inherent bias towards using external context when available, often bypassing their internal knowledge. This predisposition, termed as taking a "shortcut," is investigated through a combination of sophisticated probing techniques. The primary goal is to understand to what extent LLMs rely on external context over their parametric memory when generating responses to factual queries.

Key Findings

  1. Minimal Use of Parametric Memory: The paper demonstrates that in the presence of retrieved context, LLMs exhibit minimal usage of their internal parametric knowledge. This conclusion is supported by findings from Causal Tracing, where Average Indirect Effect (AIE) measurements indicate a significant decrease in reliance on Multi-Layer Perceptrons (MLPs) within the models. Specifically, the AIE for LLaMa-2 (7B) and Phi-2 shows a fivefold decrease when RAG context is introduced, highlighting that models lean heavily on external context for factual information.
  2. Enrichment from Retrieved Context: Attention Contributions and Knockout Analyses reveal that the last token residual stream derives more enriched information from the context rather than the subject token in the original query. Attention Contribution metrics indicate that in both models, the presence of RAG context significantly reduces the attention given to subject tokens, shifting focus primarily to the attribute tokens explicitly present in the context. Supporting this, Attention Knockouts confirm that eliminating attention weights from the subject token results in minimal degradation in prediction quality, further solidifying the preference for external context.

Empirical Methods

The research employs rigorous empirical techniques:

  • Causal Tracing: This method is used to identify critical hidden states impacting factual predictions by measuring the AIE. The findings from causal tracing substantiate that MLPs' contributions decrease significantly in the presence of RAG, both in LLaMa-2 and Phi-2 models.
  • Attention Contributions and Knockouts: By examining attention patterns and knocking out specific attention weights, the paper quantifies the dependency on subject tokens versus external context. The pronounced reduction in attention to subject tokens when RAG context is present suggests a strong reliance on retrievable information for factual accuracy.

Practical and Theoretical Implications

The findings of this paper have substantial implications both practically and theoretically. Practically, the paper provides insights into designing more effective QA systems that better balance internal model knowledge and external retrieval systems. This has direct applications in enhancing the reliability of LLMs in real-world tasks such as chatbots, search algorithms, and other AI-driven applications.

Theoretically, this research advances our understanding of the interplay between parametric and non-parametric knowledge in LLMs. It unveils the underlying mechanisms that drive models to prioritize external context, paving the way for future studies to further refine and optimize token attention mechanisms and memory utilization in such models.

Future Developments

Exploring the impact of longer and more complex RAG contexts is a natural progression of this work. Addressing the computational overhead associated with causal tracing for extensive contexts can yield deeper insights into proximity and recency biases in LLMs. Additionally, there is potential to extend this analysis to instruction-tuned models and those finetuned on RLHF objectives to evaluate consistency across varied model architectures and training paradigms.

Conclusion

This paper provides a nuanced understanding of how LLMs, when augmented with RAG context, preferentially utilize external information over their internal parametric knowledge. This significant shift has profound implications for the development of more accurate and efficient LLMs. The paper’s employment of Causal Tracing, Attention Contributions, and Knockouts offers valuable methodological contributions, setting a robust framework for future research in this domain.