ChatQA: Surpassing GPT-4 on Conversational QA and RAG (2401.10225v5)
Abstract: In this work, we introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA). To enhance generation, we propose a two-stage instruction tuning method that significantly boosts the performance of RAG. For effective retrieval, we introduce a dense retriever optimized for conversational QA, which yields results comparable to the alternative state-of-the-art query rewriting models, while substantially reducing deployment costs. We also present the ChatRAG Bench, which encompasses ten datasets covering comprehensive evaluations on RAG, table-related QA, arithmetic calculations, and scenarios involving unanswerable questions. Our ChatQA-1.0-70B (score: 54.14), built on Llama2, a weaker foundation model than GPT-4, can slightly outperform GPT-4-0613 (score: 53.90) and GPT-4-Turbo-2024-04-09 (score: 54.03) on the ChatRAG Bench, without relying on any synthetic data from OpenAI GPT models. Notably, the Llama3-ChatQA-1.5-70B model surpasses the accuracy of GPT-4-Turbo-2024-04-09, achieving a 4.4% improvement. To advance research in this field, we open-sourced the model weights, instruction tuning data, ChatRAG Bench, and retriever for the community: https://chatqa-project.github.io/.
- Topiocqa: Open-domain conversational question answering with topic switching. TACL, 2022.
- Building and evaluating open-domain dialogue corpora with clarifying questions. In EMNLP, 2021.
- Open-domain question answering goes conversational via question rewriting. In NAACL, 2021.
- Anthropic. Introducing 100k context windows, 2023a.
- Anthropic. Introducing Claude, 2023b.
- Coqar: Question rewriting on coqa. In LREC, 2022.
- Doqa-accessing domain-specific faqs via conversational qa. In ACL, 2020.
- Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. In EMNLP, 2022a.
- Reinforced question rewriting for conversational question answering. In EMNLP, 2022b.
- Quac: Question answering in context. In EMNLP, 2018.
- How to ask better questions? a large-scale multi-domain dataset for rewriting ill-formed questions. In AAAI, 2020.
- Scaling instruction-finetuned language models. arXiv preprint arXiv: 2210.11416, 2022.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023a.
- Free Dolly: Introducing the world’s first truly open instruction-tuned llm, 2023b.
- Dialog inpainting: Turning documents to dialogs. In ICML, 2022.
- Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP, 2019.
- Question rewriting for open-domain conversational qa: Best practices and limitations. In CIKM, 2021.
- Pacific: Towards proactive conversational question answering over tabular and textual data in finance. In EMNLP, 2022.
- Glm: General language model pretraining with autoregressive blank infilling. In ACL, 2022.
- Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL, 2019.
- Can you unpack that? learning to rewrite questions-in-context. In EMNLP, 2019.
- Eli5: Long form question answering. In ACL, 2019.
- doc2dial: A goal-oriented document-grounded dialogue dataset. In EMNLP, 2020.
- Rewriting conversational utterances with instructed large language models. In IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, 2023.
- Unigdd: A unified generative framework for goal-oriented document-grounded dialogue. In ACL, 2022.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
- Google. Introducing bard, 2023.
- Abg-coqa: Clarifying ambiguity in conversational question answering. In AKBC, 2021.
- Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
- Can question rewriting help conversational question answering? In Proceedings of the Third Workshop on Insights from Negative Results in NLP, 2022.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
- Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 2021.
- Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022.
- Soda: Million-scale dialogue distillation with social commonsense contextualization. arXiv preprint arXiv:2212.10465, 2022.
- The narrativeqa reading comprehension challenge. TACL, 2018.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Openassistant conversations - democratizing large language model alignment. arXiv preprint arXiv: 2304.07327, 2023.
- Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, 2019.
- How to train your dragon: Diverse augmentation towards generalizable dense retrieval. arXiv preprint arXiv:2302.07452, 2023a.
- Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352, 2023b.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Adaptive utterance rewriting for conversational search. Information Processing & Management, 2021.
- Cross-task generalization via natural language crowdsourcing instructions. In ACL, 2022.
- Convgqr: Generative query reformulation for conversational search. arXiv preprint arXiv:2305.15645, 2023.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
- Hybridialogue: An information-seeking dialogue dataset grounded on tabular and textual data. In Findings of ACL, 2022.
- Ms marco: A human generated machine reading comprehension dataset. choice, 2016.
- OpenAI. Introducing ChatGPT, 2022.
- OpenAI. GPT-4, 2023.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- Compositional semantic parsing on semi-structured tables. In ACL, 2015.
- Open-retrieval conversational question answering. In SIGIR, 2020.
- Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
- Know what you don’t know: Unanswerable questions for squad. In ACL, 2018.
- Question rewriting? assessing its importance for conversational question answering. In ECIR, 2022.
- Coqa: A conversational question answering challenge. TACL, 2019.
- Interpretation of natural language rules in conversational machine reading. In EMNLP, 2018.
- Multitask prompted training enables zero-shot task generalization. In ICLR, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, 2017.
- Question rewriting for conversational question answering. In WSDM, 2021a.
- A comparison of question rewriting methods for conversational passage retrieval. In ECIR, 2021b.
- Instructretro: Instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv:2310.07713, 2023a.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022a.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In EMNLP, 2022c.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023b.
- Finetuned language models are zero-shot learners. In ICLR, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022b.
- Conqrr: Conversational query rewriting for retrieval with reinforcement learning. In EMNLP, 2022.
- Inscit: Information-seeking conversations with mixed-initiative interactions. TACL, 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023a.
- Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025, 2023b.
- Enhancing conversational search: Large language model-aided informative query rewriting. In EMNLP, pp. 5985–6006, 2023.
- Few-shot generative conversational query rewriting. In SIGIR, 2020.
- Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
- Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In ACL, 2021.