Introduction
The paradigm of prompting LLMs to perform natural language reasoning in a step-by-step fashion, recognized as chains of thoughts (CoT), has seen significant success, especially for tasks where all necessary information is assumed to appear in the LLM's learned parameters. However, LLMs often falter when external, up-to-the-minute knowledge is required for open-domain multi-step question answering (QA). Traditional solutions to augment LLMs have involved a singular retrieval step, but this is less effective for complex queries where retrieved information is needed incrementally, as reasoning progresses.
Interleaved Retrieval and CoT (IRCoT) Method
Proposed in this work is a method called Interleaved Retrieval guided by Chain-of-Thought (IRCoT), an approach that interlaces retrieval with the CoT process. Initially, documents are extracted using the question as a query. During the answering process, retrieval and reasoning mutually inform each subsequent action. The CoT generation step builds on existing reasoning and collected paragraphs to craft the next sentence of the CoT sequence. Reversely, the newly generated CoT sentence guides the retrieval of additional evidence. This cycle repeats until a termination criterion is reached, contemporaneously enhancing both CoT generation quality and relevance of retrieved information.
Efficacy of IRCoT
IRCoT's performance has been evaluated across multiple datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. Employing GPT3 and Flan-T5 models, the IRCoT approach surpasses baseline single-step retrieval by a substantial margin, both in terms of recall and QA performance. Furthermore, IRCoT demonstrates robustness in out-of-distribution scenarios and is effective even with smaller LLMs. To aid practical replication and future research, resources including code and data prompts are publicly accessible online.
Conclusions and Further Remarks
IRCoT stands as a notable approach, intertwining retrieval and CoT generation to navigate open-domain, multi-step QA tasks effectively. This technique elevates both the accuracy of retrieved information and the factual reliability of generated CoTs, with gains witnessed across diverse LLM sizes and testing conditions. Despite its certain dependencies on specific LLM capabilities, such as zero or few-shot CoT generation ability and support for longer contexts, IRCoT represents a stride forward in the domain of knowledge-intensive QA, potentially informing a variety of future LLM applications.