Evaluating ChatGPT as a Question Answering System: A Comprehensive Analysis and Comparison with Existing Models (2312.07592v1)
Abstract: In the current era, a multitude of LLMs has emerged to cater to user inquiries. Notably, the GPT-3.5 Turbo LLM has gained substantial attention as the underlying technology for ChatGPT. Leveraging extensive parameters, this model adeptly responds to a wide range of questions. However, due to its reliance on internal knowledge, the accuracy of responses may not be absolute. This article scrutinizes ChatGPT as a Question Answering System (QAS), comparing its performance to other existing QASs. The primary focus is on evaluating ChatGPT's proficiency in extracting responses from provided paragraphs, a core QAS capability. Additionally, performance comparisons are made in scenarios without a surrounding passage. Multiple experiments, exploring response hallucination and considering question complexity, were conducted on ChatGPT. Evaluation employed well-known Question Answering (QA) datasets, including SQuAD, NewsQA, and PersianQuAD, across English and Persian languages. Metrics such as F-score, exact match, and accuracy were employed in the assessment. The study reveals that, while ChatGPT demonstrates competence as a generative model, it is less effective in question answering compared to task-specific models. Providing context improves its performance, and prompt engineering enhances precision, particularly for questions lacking explicit answers in provided paragraphs. ChatGPT excels at simpler factual questions compared to "how" and "why" question types. The evaluation highlights occurrences of hallucinations, where ChatGPT provides responses to questions without available answers in the provided context.
- Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference, pages 219–224, 1961.
- William A Woods. Progress in natural language understanding: an application to lunar geology. In Proceedings of the June 4-8, 1973, national computer conference and exposition, pages 441–450, 1973.
- Ellen M Voorhees. The trec question answering track. Natural Language Engineering, 7(4):361–378, 2001.
- Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774, 2021.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Ms marco: A human-generated machine reading comprehension dataset. 2016.
- Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2013–2018, 2015.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
- Chatgpt versus traditional question answering for knowledge graphs: Current status and future directions towards knowledge graph chatbots. arXiv preprint arXiv:2302.06466v1, Feb 2023. doi:10.48550/arXiv.2302.06466v1. URL https://arxiv.org/abs/2302.06466v1.
- Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
- Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830, 2016.
- Persianquad: the native question answering dataset for the persian language. IEEE Access, 10:26045–26057, 2022.
- Luke: Deep contextualized entity representations with entity-aware self-attention. arXiv preprint arXiv:2010.01057, 2020. doi:10.48550/arXiv.2010.01057. URL https://arxiv.org/abs/2010.01057. Submitted on 2 Oct 2020.
- Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019. doi:10.48550/arXiv.1906.08237. URL https://arxiv.org/abs/1906.08237. 1Carnegie Mellon University, 2Google AI Brain Team.
- Spanbert: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529, 2019. doi:10.48550/arXiv.1907.10529. URL https://arxiv.org/abs/1907.10529. Equal contribution.
- Retrospective reader for machine reading comprehension. arXiv preprint arXiv:2001.09694, 2020. doi:10.48550/arXiv.2001.09694. URL https://arxiv.org/abs/2001.09694. Corresponding author. This paper was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100), Key Projects of National Natural Science Foundation of China (U1836222 and 61733011), Huawei-SJTU long term AI project, Cutting-edge Machine reading comprehension and language model.
- Albert: A lite bert for self-supervised learning of language representations. International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.
- Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering. arXiv preprint arXiv:2210.02627, 2022. doi:10.48550/arXiv.2210.02627. URL https://arxiv.org/abs/2210.02627. This paper is awaiting publication at TACL, and this is a pre-MIT Press publication version.
- Learning to generate questions by learning to recover answer-containing sentences. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1516–1529. Association for Computational Linguistics, August 2021. doi:10.18653/v1/2021.findings-acl.147. URL https://aclanthology.org/2021.findings-acl.147. These authors contributed equally.
- A question-focused multi-factor attention network for question answering. Department of Computer Science, National University of Singapore, 2020. mailto:souvik,[email protected].
- Densely connected attention propagation for reading comprehension. 2020.
- Hossein Bahak (1 paper)
- Farzaneh Taheri (1 paper)
- Zahra Zojaji (3 papers)
- Arefeh Kazemi (3 papers)