Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Much Can RAG Help the Reasoning of LLM? (2410.02338v2)

Published 3 Oct 2024 in cs.CL and cs.AI
How Much Can RAG Help the Reasoning of LLM?

Abstract: Retrieval-Augmented Generation (RAG) has gained significant popularity in modern LLMs due to its effectiveness in introducing new knowledge and reducing hallucinations. However, the deep understanding of RAG remains limited, how does RAG help the reasoning process and can RAG help improve the reasoning capability remains question. While external documents are typically considered as a method to incorporate domain-specific information, they also contain intermediate reasoning results related to the query, this suggests that documents could enhance the reasoning capability of LLMs, which has not been previously explored. In this paper, we investigate this issue in depth and find that while RAG can assist with reasoning, the help is limited. If we conceptualize the reasoning process as a tree with fixed depth, then RAG struggles to assist LLMs in performing deeper reasoning. Additionally, the information in the documents requires preprocessing to filter out noise. We demonstrate that this preprocessing is difficult to achieve simply fine-tuning of the LLM, it often necessitates numerous additional transformer layers to solve the problem. To simplify the problem, we propose DPrompt tuning, which effectively resolves the issue within just limited transformer layers, leading to improved performance.

The paper "How Much Can RAG Help the Reasoning of LLM?" investigates the extent to which Retrieval-Augmented Generation (RAG) enhances the reasoning capabilities of LLMs. The paper posits that while RAG can improve reasoning, its assistance is limited, and the information retrieved often necessitates preprocessing to filter out noise. The paper introduces DPrompt tuning to address these challenges effectively.

The authors begin by framing the reasoning process of an LLM as computing p(yq)p(y|q), where qq is the query and yy is the answer. In this framework, RAG is expressed as p(yq,d1,d2,,dk)p(y|q, d_1, d_2, \ldots, d_k), where did_i represents the i-thi\text{-th} retrieved document. This is contrasted with Chain of Thought (CoT), represented as p(yq,x1,x2,,xk)p(y|q, x_1, x_2, \ldots, x_k), where xix_i is a step-by-step reasoning result. The paper explores whether RAG can enhance reasoning capabilities akin to CoT.

Considering the reasoning steps of an LLM as a tree, where the maximum reasoning depth is LL, the number of nodes in layer ii is nin_i, and ui,ju_{i,j} refers to the j-thj\text{-th} node in the i-thi\text{-th} layer, the authors define a Reasoning Tree. A retrieved document dd contains relevant information that can replace certain reasoning nodes. The authors introduce the concept of "fission" to describe the elimination of nodes. The paper posits that the cost of information extraction from documents is lower than that of pure inference, formalized in Assumption 1, which states that extracting information from a document requires λl\lambda \cdot l layers, where λ<1\lambda < 1.

Theorem 1 (Fission of RAG) states that the erased nodes cannot propagate indefinitely like a fission reaction but stop at a certain threshold. Specifically, when pl<11qlnlnqlnp_l < 1 - \frac{1}{q_l^n - \ln q_l^n}, there exists t^(0,1)\hat{t} \in (0,1) satisfying g(t^)=0g(\hat{t})=0, and for t>t^t > \hat{t}, g(t)<0g(t)<0. Theorem 2 indicates that high-probability layer erasure is primarily achievable in lower layers, suggesting that the assistance of RAG is limited since most retrieved documents contain only basic information.

The paper also addresses the impact of noise in retrieved documents. Theorem 3 (Impact of Noise) provides an upper bound on the generalization error L(δ)L(1)\mathcal{L}(\delta)-\mathcal{L}(1) in terms of the mutual information I(;)I(; ) between the document embedding and the relevance of tokens. Equation 2, L(δ)L(1)c1((1δ)H()δ(I(;)+1))H(r)+c2\mathcal{L}(\delta)-\mathcal{L}(1) \leq c_1 \sqrt{\left( \frac{(1-\delta)H(|)}{\delta (I(;)+1)} \right) \cdot H(_r)}+ c_2, shows that the generalization error decreases as δ\delta increases, where δ\delta is the percentage of related documents, %%%%27%%%%H(\cdot)%%%%28%%%%I(\cdot;\cdot) represents the entropy and mutual information.

The authors explore the challenges of fine-tuning the model to filter out irrelevant information. Theorem 4 shows that approximating the optimal attention pattern is difficult without significantly altering the original attention of relevant tokens. Theorem 5 states that achieving better performance requires the input to effectively distinguish irrelevant tokens and possess the necessary information for inference. Equation 5, Pr(f(x)f^(x)>t)>1C(H()δI(^;)+1H(^)I(;))\Pr \left(||f(x)-\hat{f}(x)|| > t' \right) > \frac{1}{C} \left(H()-\delta \frac{I(\hat{};)+1}{H(\hat{})} \cdot I(;)\right), shows that the probability of deviation from the optimal function increases with noise.

The paper discusses the number of layers required for effective filtering, noting that judging the relevance of a token necessitates considering multiple tokens. Theorem 6 demonstrates that a single layer of multi-head attention struggles to assess the relevance of a token. The authors propose transforming the triple-wise problem into a pair-wise one by precalculating information from the document and representing it as virtual prompt tokens. Proposition 1 shows that putting the query ahead of documents helps the judgement.

Finally, the authors introduce DPrompt tuning. This method trains an additional model to extract information from the document and uses the embeddings to encapsulate this information. They perform inference with the query both before and after the documents and employ LoRA fine-tuning. Experiments on a subset of the Natural Questions (NQ) dataset show that DPrompt tuning achieves significant performance improvements.

In summary, the paper provides a theoretical and empirical analysis of the capabilities and limitations of RAG in enhancing the reasoning of LLMs, highlighting the challenges posed by noise and the potential of DPrompt tuning to address these challenges.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jingyu Liu (53 papers)
  2. Jiaen Lin (2 papers)
  3. Yong Liu (721 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com