GPT-4's assessment of its performance in a USMLE-based case study (2402.09654v2)
Abstract: This study investigates GPT-4's assessment of its performance in healthcare applications. A simple prompting technique was used to prompt the LLM with questions taken from the United States Medical Licensing Examination (USMLE) questionnaire and it was tasked to evaluate its confidence score before posing the question and after asking the question. The questionnaire was categorized into two groups-questions with feedback (WF) and questions with no feedback(NF) post-question. The model was asked to provide absolute and relative confidence scores before and after each question. The experimental findings were analyzed using statistical tools to study the variability of confidence in WF and NF groups. Additionally, a sequential analysis was conducted to observe the performance variation for the WF and NF groups. Results indicate that feedback influences relative confidence but doesn't consistently increase or decrease it. Understanding the performance of LLM is paramount in exploring its utility in sensitive areas like healthcare. This study contributes to the ongoing discourse on the reliability of AI, particularly of LLMs like GPT-4, within healthcare, offering insights into how feedback mechanisms might be optimized to enhance AI-assisted medical education and decision support.
- Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
- Anand Gokul. Llms and ai: Understanding its reach and impact. arXiv, 2023.
- Efficiently measuring the cognitive ability of llms: An adaptive testing perspective. arXiv preprint arXiv:2306.10512, 2023.
- Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023.
- Guidance for authors, peer reviewers, and editors on use of ai, language models, and chatbots. JAMA, 2023.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Som Biswas. Role of chatgpt in law: According to chatgpt. Available at SSRN 4405398, 2023.
- Human resource management in the age of generative artificial intelligence: Perspectives and research directions on chatgpt. Human Resource Management Journal, 33(3):606–659, 2023.
- Knowledge-grounded dialogue modelling with dialogue-state tracking, domain tracking, and entity extraction. Computer Speech & Language, 78:101460, 2023.
- The impact of using ai chat gpt on marketing effectiveness: A case study on instagram marketing. Indonesian Journal of Economics and Management, 3(3):603–617, 2023.
- Assessing the accuracy and reliability of ai-generated medical responses: an evaluation of the chat-gpt model. Research square, 2023.
- The confidence-competence gap in large language models: A cognitive study, 2023.
- Meghan Holohan. A boy saw 17 doctors over 3 years for chronic pain. chatgpt found the diagnosis. TODAY.com, Sep 2023.
- Ai-generated medical advice—gpt and beyond. Jama, 329(16):1349–1350, 2023.
- Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6):bbac409, 2022.
- Radiology-gpt: A large language model for radiology. arXiv preprint arXiv:2306.08666, 2023.
- Chatgpt for healthcare services: An emerging stage for an innovative perspective. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 3(1):100105, 2023.
- Comparing chatgpt and gpt-4 performance in usmle soft skill assessments. Scientific Reports, 13(1):16492, 2023.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
- The limits of prompt engineering in medical problem-solving: a comparative analysis with chatgpt on calculation based usmle medical questions. medRxiv, pages 2023–08, 2023.
- The nuances of large-language-model-agent performance in simple english auctions. Empirical Economics Letters, 22(1), Jan 2023.
- Ai psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. PsyArXiv, 2022.
- Fine-tuning a llm using reinforcement learning from human feedback for a therapy chatbot application, 2023.
- The global landscape of ai ethics guidelines. Nature machine intelligence, 1(9):389–399, 2019.
- Ai-assisted decision-making in healthcare: the application of an ethics framework for big data in health and research. Asian Bioethics Review, 11:299–314, 2019.
- Do large language models show human–like biases? exploring confidence–competence gap in ai. Information, 15(2), 2024.
- Top toughest exams in the world (2024), 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.