Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChatQA: Surpassing GPT-4 on Conversational QA and RAG (2401.10225v5)

Published 18 Jan 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: In this work, we introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA). To enhance generation, we propose a two-stage instruction tuning method that significantly boosts the performance of RAG. For effective retrieval, we introduce a dense retriever optimized for conversational QA, which yields results comparable to the alternative state-of-the-art query rewriting models, while substantially reducing deployment costs. We also present the ChatRAG Bench, which encompasses ten datasets covering comprehensive evaluations on RAG, table-related QA, arithmetic calculations, and scenarios involving unanswerable questions. Our ChatQA-1.0-70B (score: 54.14), built on Llama2, a weaker foundation model than GPT-4, can slightly outperform GPT-4-0613 (score: 53.90) and GPT-4-Turbo-2024-04-09 (score: 54.03) on the ChatRAG Bench, without relying on any synthetic data from OpenAI GPT models. Notably, the Llama3-ChatQA-1.5-70B model surpasses the accuracy of GPT-4-Turbo-2024-04-09, achieving a 4.4% improvement. To advance research in this field, we open-sourced the model weights, instruction tuning data, ChatRAG Bench, and retriever for the community: https://chatqa-project.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Topiocqa: Open-domain conversational question answering with topic switching. TACL, 2022.
  2. Building and evaluating open-domain dialogue corpora with clarifying questions. In EMNLP, 2021.
  3. Open-domain question answering goes conversational via question rewriting. In NAACL, 2021.
  4. Anthropic. Introducing 100k context windows, 2023a.
  5. Anthropic. Introducing Claude, 2023b.
  6. Coqar: Question rewriting on coqa. In LREC, 2022.
  7. Doqa-accessing domain-specific faqs via conversational qa. In ACL, 2020.
  8. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. In EMNLP, 2022a.
  9. Reinforced question rewriting for conversational question answering. In EMNLP, 2022b.
  10. Quac: Question answering in context. In EMNLP, 2018.
  11. How to ask better questions? a large-scale multi-domain dataset for rewriting ill-formed questions. In AAAI, 2020.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv: 2210.11416, 2022.
  13. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023a.
  14. Free Dolly: Introducing the world’s first truly open instruction-tuned llm, 2023b.
  15. Dialog inpainting: Turning documents to dialogs. In ICML, 2022.
  16. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP, 2019.
  17. Question rewriting for open-domain conversational qa: Best practices and limitations. In CIKM, 2021.
  18. Pacific: Towards proactive conversational question answering over tabular and textual data in finance. In EMNLP, 2022.
  19. Glm: General language model pretraining with autoregressive blank infilling. In ACL, 2022.
  20. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL, 2019.
  21. Can you unpack that? learning to rewrite questions-in-context. In EMNLP, 2019.
  22. Eli5: Long form question answering. In ACL, 2019.
  23. doc2dial: A goal-oriented document-grounded dialogue dataset. In EMNLP, 2020.
  24. Rewriting conversational utterances with instructed large language models. In IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, 2023.
  25. Unigdd: A unified generative framework for goal-oriented document-grounded dialogue. In ACL, 2022.
  26. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  27. Google. Introducing bard, 2023.
  28. Abg-coqa: Clarifying ambiguity in conversational question answering. In AKBC, 2021.
  29. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
  30. Can question rewriting help conversational question answering? In Proceedings of the Third Workshop on Insights from Negative Results in NLP, 2022.
  31. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
  32. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 2021.
  33. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022.
  34. Soda: Million-scale dialogue distillation with social commonsense contextualization. arXiv preprint arXiv:2212.10465, 2022.
  35. The narrativeqa reading comprehension challenge. TACL, 2018.
  36. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  37. Openassistant conversations - democratizing large language model alignment. arXiv preprint arXiv: 2304.07327, 2023.
  38. Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, 2019.
  39. How to train your dragon: Diverse augmentation towards generalizable dense retrieval. arXiv preprint arXiv:2302.07452, 2023a.
  40. Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352, 2023b.
  41. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  42. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  43. Adaptive utterance rewriting for conversational search. Information Processing & Management, 2021.
  44. Cross-task generalization via natural language crowdsourcing instructions. In ACL, 2022.
  45. Convgqr: Generative query reformulation for conversational search. arXiv preprint arXiv:2305.15645, 2023.
  46. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
  47. Hybridialogue: An information-seeking dialogue dataset grounded on tabular and textual data. In Findings of ACL, 2022.
  48. Ms marco: A human generated machine reading comprehension dataset. choice, 2016.
  49. OpenAI. Introducing ChatGPT, 2022.
  50. OpenAI. GPT-4, 2023.
  51. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  52. Compositional semantic parsing on semi-structured tables. In ACL, 2015.
  53. Open-retrieval conversational question answering. In SIGIR, 2020.
  54. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
  55. Know what you don’t know: Unanswerable questions for squad. In ACL, 2018.
  56. Question rewriting? assessing its importance for conversational question answering. In ECIR, 2022.
  57. Coqa: A conversational question answering challenge. TACL, 2019.
  58. Interpretation of natural language rules in conversational machine reading. In EMNLP, 2018.
  59. Multitask prompted training enables zero-shot task generalization. In ICLR, 2022.
  60. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  61. Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, 2017.
  62. Question rewriting for conversational question answering. In WSDM, 2021a.
  63. A comparison of question rewriting methods for conversational passage retrieval. In ECIR, 2021b.
  64. Instructretro: Instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv:2310.07713, 2023a.
  65. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022a.
  66. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
  67. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In EMNLP, 2022c.
  68. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023b.
  69. Finetuned language models are zero-shot learners. In ICLR, 2022a.
  70. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022b.
  71. Conqrr: Conversational query rewriting for retrieval with reinforcement learning. In EMNLP, 2022.
  72. Inscit: Information-seeking conversations with mixed-initiative interactions. TACL, 2023.
  73. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023a.
  74. Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025, 2023b.
  75. Enhancing conversational search: Large language model-aided informative query rewriting. In EMNLP, pp.  5985–6006, 2023.
  76. Few-shot generative conversational query rewriting. In SIGIR, 2020.
  77. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
  78. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  79. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In ACL, 2021.
Citations (25)

Summary

  • The paper presents ChatQA-70B, a white-box model employing a two-stage instruction tuning method to reach GPT-4-level accuracy on conversational QA tasks.
  • It integrates supervised fine-tuning on diverse datasets with a fine-tuned dense retriever, optimizing multi-turn context retrieval effectively.
  • The study demonstrates significant cost efficiency and reduced hallucination by balancing answerability, ensuring robust performance in context-rich environments.

Introduction

The development of conversational question answering (QA) models has seen a significant leap with the advent of models like ChatGPT and its successors. These models hold great promise for real-world applications as they can engage with users conversationally, generate answers in a zero-shot manner, and process information beyond a LLM’s typical context window. A pertinent challenge in this domain has been constructing a conversational QA model that matches the performance of cutting-edge models like GPT-4 while remaining cost-effective.

ChatQA Model Architecture

The presented paper introduces ChatQA-70B, a white-box conversational QA model that achieves GPT-4 level accuracy through a unique two-stage instruction tuning method. The first stage enhances the model’s ability to understand and integrate user-provided or retrieved context for conversational tasks. The second stage, context-enhanced instruction tuning, further sharpens the model’s performance in handling context-rich conversations. The paper also advances the retrieval process in conversational QA by fine-tuning a dense retriever on a quality multi-turn QA dataset, resulting in comparable performance to LLM-based query rewriting models, but with substantial cost reduction.

Methodology

The two-stage instruction tuning comprises supervised fine-tuning using diverse high-quality datasets, followed by tuning with a blend of single-turn and multi-turn conversational QA datasets. Furthermore, the model's robustness against hallucination is tested in scenarios where information is not available in the context, steering the model to provide a "cannot answer" response when necessary. This approach achieves a fine balance between answerability and non-answerability, crucial for reducing misinformation.

Retrieval Optimization

In addition to the focus on conversational abilities, the retrieval process has been fine-tuned to be more effective for multi-turn conversational queries. This is achieved by using high-quality conversational datasets for fine-tuning a single-turn query retriever, paving the way for better context retrieval without the need for an expensive standalone query rewriter. The retrieval-optimized ChatQA is then compared to current state-of-the-art solutions on ten conversational QA datasets, showcasing its superior performance.

Conclusion

The results of the comprehensive evaluation illustrate that the introduced ChatQA-70B model outperforms or is comparable to industry-standard models such as GPT-3.5-turbo and GPT-4. This performance is particularly notable, given it does not rely on synthetic data from existing LLMs. Moreover, the paper identifies cost efficiency in retrieval operations as a notable contribution, demonstrating similar or better performance without incurring extra computational expenses. This work represents a milestone in conversational QA modeling and provides a promising direction for future research and practical applications.

Github Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com