The paper "How Much Can RAG Help the Reasoning of LLM?" investigates the extent to which Retrieval-Augmented Generation (RAG) enhances the reasoning capabilities of LLMs. The paper posits that while RAG can improve reasoning, its assistance is limited, and the information retrieved often necessitates preprocessing to filter out noise. The paper introduces DPrompt tuning to address these challenges effectively.
The authors begin by framing the reasoning process of an LLM as computing , where is the query and is the answer. In this framework, RAG is expressed as , where represents the retrieved document. This is contrasted with Chain of Thought (CoT), represented as , where is a step-by-step reasoning result. The paper explores whether RAG can enhance reasoning capabilities akin to CoT.
Considering the reasoning steps of an LLM as a tree, where the maximum reasoning depth is , the number of nodes in layer is , and refers to the node in the layer, the authors define a Reasoning Tree. A retrieved document contains relevant information that can replace certain reasoning nodes. The authors introduce the concept of "fission" to describe the elimination of nodes. The paper posits that the cost of information extraction from documents is lower than that of pure inference, formalized in Assumption 1, which states that extracting information from a document requires layers, where .
Theorem 1 (Fission of RAG) states that the erased nodes cannot propagate indefinitely like a fission reaction but stop at a certain threshold. Specifically, when , there exists satisfying , and for , . Theorem 2 indicates that high-probability layer erasure is primarily achievable in lower layers, suggesting that the assistance of RAG is limited since most retrieved documents contain only basic information.
The paper also addresses the impact of noise in retrieved documents. Theorem 3 (Impact of Noise) provides an upper bound on the generalization error in terms of the mutual information between the document embedding and the relevance of tokens. Equation 2, , shows that the generalization error decreases as increases, where is the percentage of related documents, represents the entropy and mutual information.
The authors explore the challenges of fine-tuning the model to filter out irrelevant information. Theorem 4 shows that approximating the optimal attention pattern is difficult without significantly altering the original attention of relevant tokens. Theorem 5 states that achieving better performance requires the input to effectively distinguish irrelevant tokens and possess the necessary information for inference. Equation 5, , shows that the probability of deviation from the optimal function increases with noise.
The paper discusses the number of layers required for effective filtering, noting that judging the relevance of a token necessitates considering multiple tokens. Theorem 6 demonstrates that a single layer of multi-head attention struggles to assess the relevance of a token. The authors propose transforming the triple-wise problem into a pair-wise one by precalculating information from the document and representing it as virtual prompt tokens. Proposition 1 shows that putting the query ahead of documents helps the judgement.
Finally, the authors introduce DPrompt tuning. This method trains an additional model to extract information from the document and uses the embeddings to encapsulate this information. They perform inference with the query both before and after the documents and employ LoRA fine-tuning. Experiments on a subset of the Natural Questions (NQ) dataset show that DPrompt tuning achieves significant performance improvements.
In summary, the paper provides a theoretical and empirical analysis of the capabilities and limitations of RAG in enhancing the reasoning of LLMs, highlighting the challenges posed by noise and the potential of DPrompt tuning to address these challenges.