An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought (2407.15569v2)

Published 22 Jul 2024 in cs.CL

Abstract: Since the launch of ChatGPT at the end of 2022, generative dialogue models represented by ChatGPT have quickly become essential tools in daily life. As user expectations increase, enhancing the capability of generative dialogue models to solve complex problems has become a focal point of current research. This paper delves into the effectiveness of the RAFT (Retrieval Augmented Fine-Tuning) method in improving the performance of Generative dialogue models. RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and retrieval augmented generation (RAG), which significantly enhanced the model's information extraction and logical reasoning abilities. We evaluated the RAFT method across multiple datasets and analysed its performance in various reasoning tasks, including long-form QA and short-form QA tasks, tasks in both Chinese and English, and supportive and comparison reasoning tasks. Notably, it addresses the gaps in previous research regarding long-form QA tasks and Chinese datasets. Moreover, we also evaluate the benefit of the chain-of-thought (CoT) in the RAFT method. This work offers valuable insights for studies focused on enhancing the performance of generative dialogue models.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that RAFT, which integrates retrieval augmented generation with chain-of-thought reasoning, significantly improves dialogue model accuracy and coherence.
It uses supervised fine-tuning on datasets with both oracle and distractor documents to achieve marked gains in Exact Match and F1 scores.
The study validates RAFT’s effectiveness across multi-hop, biomedical, and Chinese QA datasets, emphasizing the critical role of intermediate reasoning steps.

An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought

The paper "An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought," by Yuetong Zhao et al., presents a methodical investigation into enhancing the performance of generative dialogue models through an innovative technique called RAFT (Retrieval Augmented Fine-Tuning). RAFT synergizes the strengths of chain-of-thought (CoT) reasoning and retrieval augmented generation (RAG) under the paradigm of supervised fine-tuning (SFT), aiming to address persistent challenges in generative dialogue models, including accuracy, coherence, and logical reasoning.

Methodology

The paper introduces RAFT, a novel method that amalgamates RAG and SFT, augmented with CoT reasoning. The RAFT approach fine-tunes generative dialogue models by training them on datasets that include questions, oracle documents, distractor documents, and CoT-style responses. This setup aims to leverage external databases while incorporating intermediate reasoning steps to improve the model's logical consistency and robustness against irrelevant information.

Two distinct methods for constructing the RAFT fine-tuning dataset are employed:

For datasets with multiple reference documents per question, the oracle documents are extracted and complemented by randomly selected distractor documents.
For datasets with single oracle documents per question, distractor documents are chosen from other questions' references.

Chain-of-thought responses are then generated using GPT-3.5, emphasizing reasoning processes that cite referenced content and produce a final answer.

Experiment Setup

The researchers evaluate the RAFT method across several datasets:

HotpotQA: A dataset containing multi-hop reasoning questions from Wikipedia.
PubMedQA: A biomedical QA dataset from PubMed abstracts, with both short and long answers.
DuReader_robust: A Chinese dataset composed of user search queries and responses from Baidu.

The paper employs baseline models for comparison:

Zero-shot prompting with LLaMA2-7B-chat and Qwen-1.5-7B-chat
RAG combined with the aforementioned models
Domain-Specific Fine-Tuning (DSF) with and without RAG

Key metrics for evaluation are the Exact Match (EM) and F1 scores, with normalization steps to ensure comparison consistency.

Results

The empirical results demonstrate substantial improvements brought by the RAFT method:

On HotpotQA, RAFT achieved a performance gain in EM score of 42.13% over plain RAG and 30.76% even when distractor documents were included. F1 score improvements were similarly robust.
On PubMedQA, RAFT showed a notable increase in F1 score for long-form QA, though the improvement was less prominent compared to short-form QA.
On DuReader_robust, RAFT obtained significant performance gains in F1 score (57.81%), highlighting its effectiveness on Chinese datasets.

Notably, the inclusion of CoT in RAFT proved to be critical. In a specific ablation paper, RAFT without CoT demonstrated lower performance, confirming the importance of CoT in guiding models through complex logical reasoning.

Implications and Future Work

The implications of this paper are multi-faceted:

Practically, RAFT enhances the robustness and accuracy of generative dialogue models in handling both short and long-form QA across different languages and complexities.
Theoretically, the combined use of CoT and retrieval methods signals a promising direction for future NLP research focused on scaling and improving model reasoning abilities without the need for excessively large parameter sizes.

Future work may explore optimizing long-form QA performance and exploring further linguistic diversities. Additionally, refining the integration mechanisms between retrieved documents and the model's inherent knowledge base remains a ripe field for exploration.

In conclusion, this paper underscores the efficacy of the RAFT method, which not only bridges current gaps in QA performance but also sets a structured path for ongoing advancements in the domain of generative dialogue models.