- The paper demonstrates that RAFT, which integrates retrieval augmented generation with chain-of-thought reasoning, significantly improves dialogue model accuracy and coherence.
- It uses supervised fine-tuning on datasets with both oracle and distractor documents to achieve marked gains in Exact Match and F1 scores.
- The study validates RAFT’s effectiveness across multi-hop, biomedical, and Chinese QA datasets, emphasizing the critical role of intermediate reasoning steps.
An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought
The paper "An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought," by Yuetong Zhao et al., presents a methodical investigation into enhancing the performance of generative dialogue models through an innovative technique called RAFT (Retrieval Augmented Fine-Tuning). RAFT synergizes the strengths of chain-of-thought (CoT) reasoning and retrieval augmented generation (RAG) under the paradigm of supervised fine-tuning (SFT), aiming to address persistent challenges in generative dialogue models, including accuracy, coherence, and logical reasoning.
Methodology
The paper introduces RAFT, a novel method that amalgamates RAG and SFT, augmented with CoT reasoning. The RAFT approach fine-tunes generative dialogue models by training them on datasets that include questions, oracle documents, distractor documents, and CoT-style responses. This setup aims to leverage external databases while incorporating intermediate reasoning steps to improve the model's logical consistency and robustness against irrelevant information.
Two distinct methods for constructing the RAFT fine-tuning dataset are employed:
- For datasets with multiple reference documents per question, the oracle documents are extracted and complemented by randomly selected distractor documents.
- For datasets with single oracle documents per question, distractor documents are chosen from other questions' references.
Chain-of-thought responses are then generated using GPT-3.5, emphasizing reasoning processes that cite referenced content and produce a final answer.
Experiment Setup
The researchers evaluate the RAFT method across several datasets:
- HotpotQA: A dataset containing multi-hop reasoning questions from Wikipedia.
- PubMedQA: A biomedical QA dataset from PubMed abstracts, with both short and long answers.
- DuReader_robust: A Chinese dataset composed of user search queries and responses from Baidu.
The paper employs baseline models for comparison:
- Zero-shot prompting with LLaMA2-7B-chat and Qwen-1.5-7B-chat
- RAG combined with the aforementioned models
- Domain-Specific Fine-Tuning (DSF) with and without RAG
Key metrics for evaluation are the Exact Match (EM) and F1 scores, with normalization steps to ensure comparison consistency.
Results
The empirical results demonstrate substantial improvements brought by the RAFT method:
- On HotpotQA, RAFT achieved a performance gain in EM score of 42.13% over plain RAG and 30.76% even when distractor documents were included. F1 score improvements were similarly robust.
- On PubMedQA, RAFT showed a notable increase in F1 score for long-form QA, though the improvement was less prominent compared to short-form QA.
- On DuReader_robust, RAFT obtained significant performance gains in F1 score (57.81%), highlighting its effectiveness on Chinese datasets.
Notably, the inclusion of CoT in RAFT proved to be critical. In a specific ablation paper, RAFT without CoT demonstrated lower performance, confirming the importance of CoT in guiding models through complex logical reasoning.
Implications and Future Work
The implications of this paper are multi-faceted:
- Practically, RAFT enhances the robustness and accuracy of generative dialogue models in handling both short and long-form QA across different languages and complexities.
- Theoretically, the combined use of CoT and retrieval methods signals a promising direction for future NLP research focused on scaling and improving model reasoning abilities without the need for excessively large parameter sizes.
Future work may explore optimizing long-form QA performance and exploring further linguistic diversities. Additionally, refining the integration mechanisms between retrieved documents and the model's inherent knowledge base remains a ripe field for exploration.
In conclusion, this paper underscores the efficacy of the RAFT method, which not only bridges current gaps in QA performance but also sets a structured path for ongoing advancements in the domain of generative dialogue models.