MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries (2401.15391v1)

Published 27 Jan 2024 in cs.CL

Abstract: Retrieval-augmented generation (RAG) augments LLMs (LLM) by retrieving relevant knowledge, showing promising potential in mitigating LLM hallucinations and enhancing response quality, thereby facilitating the great adoption of LLMs in practice. However, we find that existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence. Furthermore, to our knowledge, no existing RAG benchmarking dataset focuses on multi-hop queries. In this paper, we develop a novel dataset, MultiHop-RAG, which consists of a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. We detail the procedure of building the dataset, utilizing an English news article dataset as the underlying RAG knowledge base. We demonstrate the benchmarking utility of MultiHop-RAG in two experiments. The first experiment compares different embedding models for retrieving evidence for multi-hop queries. In the second experiment, we examine the capabilities of various state-of-the-art LLMs, including GPT-4, PaLM, and Llama2-70B, in reasoning and answering multi-hop queries given the evidence. Both experiments reveal that existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries. We hope MultiHop-RAG will be a valuable resource for the community in developing effective RAG systems, thereby facilitating greater adoption of LLMs in practice. The MultiHop-RAG and implemented RAG system is publicly available at https://github.com/yixuantt/MultiHop-RAG/.

PDF Abstract

Introduction

The advent of LLMs has considerably impacted NLP, paving the way for innovative applications like intelligent chatbots. An exciting development in this space is the concept of Retrieval-Augmented Generation (RAG), which combines the capabilities of LLMs with an external knowledge base to enhance response quality and reduce incidences of hallucination – a scenario where the model generates incorrect or nonsensical information not supported by data. Existing RAG systems perform satisfactorily with single-hop queries, where answers can be derived from individual evidence, but fall short when approaching multi-hop queries that necessitate the collection and synthesis of information from multiple sources. This paper by Tang and Yang introduces a novel benchmarking dataset named MultiHop-RAG, dedicated to evaluating the efficacy of RAG systems in handling multi-hop queries.

Dataset Construction and Analysis

MultiHop-RAG is tailored for multi-hop queries, which are realistic, complex, and reflective of the challenges faced in practical scenarios, such as financial analysis involving multiyear data. The dataset consists of a coherent knowledge base constructed from an English news article dataset and an extensive collection of multi-hop queries categorized into inference, comparison, temporal, and null types – the latter assessing an LLM's tendency to produce an answer when there is none available in the knowledge base.

The dataset employs a meticulous pipeline - delineating supporting evidence extraction, claim generation, and multi-hop query construction with verification steps for data quality assurance. The ground truth for each query encompasses simple, definitive answers that facilitate evaluating the generation abilities of LLMs using accuracy metrics. The dataset's varied query forms and demand for evidence sets of up to four pieces per query pose a significant retrieval challenge for any RAG system.

Benchmarking Experiments

To benchmark multi-hop query retrieval and reasoning, two separate experiments were conducted. The first evaluated various embedding models for evidence retrieval, where the best-performing model, even when enhanced with Reranker, attained merely 0.7467 Hits@10, indicating a substantial gap in multi-hop query retrieval efficiency. The second experiment examined state-of-the-art LLMs' reasoning capability by generating responses based on retrieved evidence. The leading commercial LLM, GPT-4, achieved an accuracy of 0.89 using ground-truth evidence, while open-source models like Llama2-70B and Mixtral-8x7B lingered around 0.32 to 0.36, pinpointing considerable scope for improvement in reasoning from multi-source evidence.

Conclusion and Future Work

While MultiHop-RAG exposes the limitations of current RAG systems in multi-hop evidential retrieval and reasoning, it also opens up opportunities for novel RAG-related tasks. Future research could explore areas like query decomposition or the hybrid retrieval approach to retrieve from vast knowledge bases. Furthermore, as part of the ongoing development of RAG systems, more advanced free-text answer evaluation metrics and higher evidence requirements per query can be integrated into the dataset. The publicly available MultiHop-RAG dataset is thus positioned as a catalyst for advancing the field of generative AI and fostering greater LLM adoption in practice.

Advancements in RAG benchmarking and the robustness of LLMs against complex query scenarios are crucial to realizing AI models' full potential in real-world applications where multi-document comprehension and knowledge retrieval are imperative for generating accurate responses. MultiHop-RAG is envisioned to be an indispensable tool in the pursuit of such advancements.