Do Large Language Models have Shared Weaknesses in Medical Question Answering? (2310.07225v3)
Abstract: LLMs have made rapid improvement on medical benchmarks, but their unreliability remains a persistent challenge for safe real-world uses. To design for the use LLMs as a category, rather than for specific models, requires developing an understanding of shared strengths and weaknesses which appear across models. To address this challenge, we benchmark a range of top LLMs and identify consistent patterns across models. We test $16$ well-known LLMs on $874$ newly collected questions from Polish medical licensing exams. For each question, we score each model on the top-1 accuracy and the distribution of probabilities assigned. We then compare these results with factors such as question difficulty for humans, question length, and the scores of the other models. LLM accuracies were positively correlated pairwise ($0.39$ to $0.58$). Model performance was also correlated with human performance ($0.09$ to $0.13$), but negatively correlated to the difference between the question-level accuracy of top-scoring and bottom-scoring humans ($-0.09$ to $-0.14$). The top output probability and question length were positive and negative predictors of accuracy respectively (p$< 0.05$). The top scoring LLM, GPT-4o Turbo, scored $84\%$, with Claude Opus, Gemini 1.5 Pro and Llama 3/3.1 between $74\%$ and $79\%$. We found evidence of similarities between models in which questions they answer correctly, as well as similarities with human test takers. Larger models typically performed better, but differences in training, architecture, and data were also highly impactful. Model accuracy was positively correlated with confidence, but negatively correlated with question length. We find similar results with older models, and argue that these patterns are likely to persist across future models using similar training methods.
- “Large Language Models in Medicine” In Nature Medicine 29.8 Nature Publishing Group, 2023, pp. 1930–1940 DOI: 10.1038/s41591-023-02448-8
- “Capabilities of GPT-4 on Medical Challenge Problems” In arXiv:2303.13375 arXiv, 2023 DOI: 10.48550/arXiv.2303.13375
- “Towards Generalist Biomedical AI” In arXiv:2307.14334, 2023 arXiv:2307.14334
- Emily M. Bender and Alexander Koller “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Online: Association for Computational Linguistics, 2020, pp. 5185–5198 DOI: 10.18653/v1/2020.acl-main.463
- “Embers of Autoregression: Understanding Large Language Models Through the Problem They Are Trained to Solve” In arXiv:2309.13638 arXiv, 2023 DOI: 10.48550/arXiv.2309.13638
- “GLaM: Efficient Scaling of Language Models with Mixture-of-Experts” In International Conference on Machine Learning PMLR, 2022, pp. 5547–5569
- “PaLM: Scaling Language Modeling with Pathways” In arXiv:2204.02311, 2022 arXiv:2204.02311
- “Towards Expert-Level Medical Question Answering with Large Language Models” arXiv, 2023 DOI: 10.48550/arXiv.2305.09617
- “Towards Conversational Diagnostic AI” arXiv, 2024 DOI: 10.48550/arXiv.2401.05654
- “MEDITRON-70B: Scaling Medical Pretraining for Large Language Models” arXiv, 2023 DOI: 10.48550/arXiv.2311.16079
- “Med42 - a Clinical Large Language Model”, 2023
- OpenAI “GPT-4 Technical Report” In arXiv:2303.08774 arXiv, 2023 DOI: 10.48550/arXiv.2303.08774
- “Language Models Are Few-Shot Learners” In Proceedings of the 34th International Conference on Neural Information Processing Systems, NeurIPS’20 Red Hook, NY, USA: Curran Associates Inc., 2020, pp. 1877–1901
- “Llama 2: Open Foundation and Fine-Tuned Chat Models” In arXiv:2307.09288 arXiv, 2023 DOI: 10.48550/arXiv.2307.09288
- “What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams” In Applied Sciences 11.14 Multidisciplinary Digital Publishing Institute, 2021, pp. 6421 DOI: 10.3390/app11146421
- Ankit Pal, Logesh Kumar Umapathi and Malaikannan Sankarasubbu “MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering” In Proceedings of the Conference on Health, Inference, and Learning PMLR, 2022, pp. 248–260
- “Measuring Massive Multitask Language Understanding” In arXiv:2009.03300 arXiv, 2021 DOI: 10.48550/arXiv.2009.03300
- “Centrum Egzaminów Medycznych” URL: https://www.cem.edu.pl/
- “Training language models to follow instructions with human feedback” In Advances in Neural Information Processing Systems 35 Curran Associates, Inc., 2022, pp. 27730–27744 URL: https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
- “PaLM 2 Technical Report” In arXiv:2305.10403 arXiv, 2023 DOI: 10.48550/arXiv.2305.10403
- “Mixtral of Experts” arXiv, 2024 DOI: 10.48550/arXiv.2401.04088
- “Textbooks Are All You Need II: Phi-1.5 Technical Report” arXiv, 2023 DOI: 10.48550/arXiv.2309.05463
- “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers” In arXiv:2210.17323, 2022 arXiv:2210.17323
- “Large Language Models as Optimizers” In arXiv:2309.03409 arXiv, 2023 DOI: 10.48550/arXiv.2309.03409
- “Scaling Laws for Neural Language Models” In arXiv:2001.08361 arXiv, 2020 arXiv:2001.08361
- “Training Compute-Optimal Large Language Models” In arXiv:2203.15556, 2022 arXiv:2203.15556
- “Orca 2: Teaching Small Language Models How to Reason” arXiv, 2023 DOI: 10.48550/arXiv.2311.11045
- “Extracting Training Data from Large Language Models” In arXiv:2012.07805 arXiv, 2021 DOI: 10.48550/arXiv.2012.07805
- “Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets” In arXiv:2106.10328 arXiv, 2021 DOI: 10.48550/arXiv.2106.10328
- “Deduplicating Training Data Makes Language Models Better” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 8424–8445 DOI: 10.18653/v1/2022.acl-long.577
- R.Thomas McCoy, Ellie Pavlick and Tal Linzen “Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference” In arXiv:1902.01007, 2019 arXiv:1902.01007
- “Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions” In arXiv:2308.11483 arXiv, 2023 DOI: 10.48550/arXiv.2308.11483
- Bryan Wilder, Eric Horvitz and Ece Kamar “Learning to Complement Humans”, 2020 DOI: 10.48550/arXiv.2005.00582
- “Lost in the Middle: How Language Models Use Long Contexts” In arXiv:2307.03172 arXiv, 2023 arXiv:2307.03172
- “Language Models (Mostly) Know What They Know” arXiv, 2022 DOI: 10.48550/arXiv.2207.05221
- “How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering” In Transactions of the Association for Computational Linguistics 9, 2021, pp. 962–977 DOI: 10.1162/tacl_a_00407
- “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting” arXiv, 2023 DOI: 10.48550/arXiv.2310.11324
- “BBQ: A Hand-Built Bias Benchmark for Question Answering” In arXiv:2110.08193 arXiv, 2022 DOI: 10.48550/arXiv.2110.08193
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.