ChatQA: Surpassing GPT-4 on Conversational QA and RAG (2401.10225v5)

Published 18 Jan 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: In this work, we introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA). To enhance generation, we propose a two-stage instruction tuning method that significantly boosts the performance of RAG. For effective retrieval, we introduce a dense retriever optimized for conversational QA, which yields results comparable to the alternative state-of-the-art query rewriting models, while substantially reducing deployment costs. We also present the ChatRAG Bench, which encompasses ten datasets covering comprehensive evaluations on RAG, table-related QA, arithmetic calculations, and scenarios involving unanswerable questions. Our ChatQA-1.0-70B (score: 54.14), built on Llama2, a weaker foundation model than GPT-4, can slightly outperform GPT-4-0613 (score: 53.90) and GPT-4-Turbo-2024-04-09 (score: 54.03) on the ChatRAG Bench, without relying on any synthetic data from OpenAI GPT models. Notably, the Llama3-ChatQA-1.5-70B model surpasses the accuracy of GPT-4-Turbo-2024-04-09, achieving a 4.4% improvement. To advance research in this field, we open-sourced the model weights, instruction tuning data, ChatRAG Bench, and retriever for the community: https://chatqa-project.github.io/.

References (79)

Citations (25)

View on Semantic Scholar

Summary

The paper presents ChatQA-70B, a white-box model employing a two-stage instruction tuning method to reach GPT-4-level accuracy on conversational QA tasks.
It integrates supervised fine-tuning on diverse datasets with a fine-tuned dense retriever, optimizing multi-turn context retrieval effectively.
The study demonstrates significant cost efficiency and reduced hallucination by balancing answerability, ensuring robust performance in context-rich environments.

Introduction

The development of conversational question answering (QA) models has seen a significant leap with the advent of models like ChatGPT and its successors. These models hold great promise for real-world applications as they can engage with users conversationally, generate answers in a zero-shot manner, and process information beyond a LLM’s typical context window. A pertinent challenge in this domain has been constructing a conversational QA model that matches the performance of cutting-edge models like GPT-4 while remaining cost-effective.

ChatQA Model Architecture

The presented paper introduces ChatQA-70B, a white-box conversational QA model that achieves GPT-4 level accuracy through a unique two-stage instruction tuning method. The first stage enhances the model’s ability to understand and integrate user-provided or retrieved context for conversational tasks. The second stage, context-enhanced instruction tuning, further sharpens the model’s performance in handling context-rich conversations. The paper also advances the retrieval process in conversational QA by fine-tuning a dense retriever on a quality multi-turn QA dataset, resulting in comparable performance to LLM-based query rewriting models, but with substantial cost reduction.

Methodology

The two-stage instruction tuning comprises supervised fine-tuning using diverse high-quality datasets, followed by tuning with a blend of single-turn and multi-turn conversational QA datasets. Furthermore, the model's robustness against hallucination is tested in scenarios where information is not available in the context, steering the model to provide a "cannot answer" response when necessary. This approach achieves a fine balance between answerability and non-answerability, crucial for reducing misinformation.

Retrieval Optimization

In addition to the focus on conversational abilities, the retrieval process has been fine-tuned to be more effective for multi-turn conversational queries. This is achieved by using high-quality conversational datasets for fine-tuning a single-turn query retriever, paving the way for better context retrieval without the need for an expensive standalone query rewriter. The retrieval-optimized ChatQA is then compared to current state-of-the-art solutions on ten conversational QA datasets, showcasing its superior performance.

Conclusion

The results of the comprehensive evaluation illustrate that the introduced ChatQA-70B model outperforms or is comparable to industry-standard models such as GPT-3.5-turbo and GPT-4. This performance is particularly notable, given it does not rely on synthetic data from existing LLMs. Moreover, the paper identifies cost efficiency in retrieval operations as a notable contribution, demonstrating similar or better performance without incurring extra computational expenses. This work represents a milestone in conversational QA modeling and provides a promising direction for future research and practical applications.

PDF Markdown

Related Papers

GitHub

Introducing ChatQA-1.5

Tweets

https://twitter.com/_akhaliq/status/1748168711368802328

https://twitter.com/_weiping/status/1866965146700029977

https://twitter.com/zihan_johan_liu/status/1866978337018552648

https://twitter.com/_reachsumit/status/1748168269541101901

https://twitter.com/fly51fly/status/1748333322617737228

https://twitter.com/lucifer_x007/status/1821499433780920831

Reddit

ChatQA: Building GPT-4 Level Conversational QA Models (1 point, 1 comment)