Introduction
The advent of LLMs has considerably impacted NLP, paving the way for innovative applications like intelligent chatbots. An exciting development in this space is the concept of Retrieval-Augmented Generation (RAG), which combines the capabilities of LLMs with an external knowledge base to enhance response quality and reduce incidences of hallucination – a scenario where the model generates incorrect or nonsensical information not supported by data. Existing RAG systems perform satisfactorily with single-hop queries, where answers can be derived from individual evidence, but fall short when approaching multi-hop queries that necessitate the collection and synthesis of information from multiple sources. This paper by Tang and Yang introduces a novel benchmarking dataset named MultiHop-RAG, dedicated to evaluating the efficacy of RAG systems in handling multi-hop queries.
Dataset Construction and Analysis
MultiHop-RAG is tailored for multi-hop queries, which are realistic, complex, and reflective of the challenges faced in practical scenarios, such as financial analysis involving multiyear data. The dataset consists of a coherent knowledge base constructed from an English news article dataset and an extensive collection of multi-hop queries categorized into inference, comparison, temporal, and null types – the latter assessing an LLM's tendency to produce an answer when there is none available in the knowledge base.
The dataset employs a meticulous pipeline - delineating supporting evidence extraction, claim generation, and multi-hop query construction with verification steps for data quality assurance. The ground truth for each query encompasses simple, definitive answers that facilitate evaluating the generation abilities of LLMs using accuracy metrics. The dataset's varied query forms and demand for evidence sets of up to four pieces per query pose a significant retrieval challenge for any RAG system.
Benchmarking Experiments
To benchmark multi-hop query retrieval and reasoning, two separate experiments were conducted. The first evaluated various embedding models for evidence retrieval, where the best-performing model, even when enhanced with Reranker, attained merely 0.7467 Hits@10, indicating a substantial gap in multi-hop query retrieval efficiency. The second experiment examined state-of-the-art LLMs' reasoning capability by generating responses based on retrieved evidence. The leading commercial LLM, GPT-4, achieved an accuracy of 0.89 using ground-truth evidence, while open-source models like Llama2-70B and Mixtral-8x7B lingered around 0.32 to 0.36, pinpointing considerable scope for improvement in reasoning from multi-source evidence.
Conclusion and Future Work
While MultiHop-RAG exposes the limitations of current RAG systems in multi-hop evidential retrieval and reasoning, it also opens up opportunities for novel RAG-related tasks. Future research could explore areas like query decomposition or the hybrid retrieval approach to retrieve from vast knowledge bases. Furthermore, as part of the ongoing development of RAG systems, more advanced free-text answer evaluation metrics and higher evidence requirements per query can be integrated into the dataset. The publicly available MultiHop-RAG dataset is thus positioned as a catalyst for advancing the field of generative AI and fostering greater LLM adoption in practice.
Advancements in RAG benchmarking and the robustness of LLMs against complex query scenarios are crucial to realizing AI models' full potential in real-world applications where multi-document comprehension and knowledge retrieval are imperative for generating accurate responses. MultiHop-RAG is envisioned to be an indispensable tool in the pursuit of such advancements.