Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 129 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Do Large Language Models have Shared Weaknesses in Medical Question Answering? (2310.07225v3)

Published 11 Oct 2023 in cs.CL

Abstract: LLMs have made rapid improvement on medical benchmarks, but their unreliability remains a persistent challenge for safe real-world uses. To design for the use LLMs as a category, rather than for specific models, requires developing an understanding of shared strengths and weaknesses which appear across models. To address this challenge, we benchmark a range of top LLMs and identify consistent patterns across models. We test $16$ well-known LLMs on $874$ newly collected questions from Polish medical licensing exams. For each question, we score each model on the top-1 accuracy and the distribution of probabilities assigned. We then compare these results with factors such as question difficulty for humans, question length, and the scores of the other models. LLM accuracies were positively correlated pairwise ($0.39$ to $0.58$). Model performance was also correlated with human performance ($0.09$ to $0.13$), but negatively correlated to the difference between the question-level accuracy of top-scoring and bottom-scoring humans ($-0.09$ to $-0.14$). The top output probability and question length were positive and negative predictors of accuracy respectively (p$< 0.05$). The top scoring LLM, GPT-4o Turbo, scored $84\%$, with Claude Opus, Gemini 1.5 Pro and Llama 3/3.1 between $74\%$ and $79\%$. We found evidence of similarities between models in which questions they answer correctly, as well as similarities with human test takers. Larger models typically performed better, but differences in training, architecture, and data were also highly impactful. Model accuracy was positively correlated with confidence, but negatively correlated with question length. We find similar results with older models, and argue that these patterns are likely to persist across future models using similar training methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. “Large Language Models in Medicine” In Nature Medicine 29.8 Nature Publishing Group, 2023, pp. 1930–1940 DOI: 10.1038/s41591-023-02448-8
  2. “Capabilities of GPT-4 on Medical Challenge Problems” In arXiv:2303.13375 arXiv, 2023 DOI: 10.48550/arXiv.2303.13375
  3. “Towards Generalist Biomedical AI” In arXiv:2307.14334, 2023 arXiv:2307.14334
  4. Emily M. Bender and Alexander Koller “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Online: Association for Computational Linguistics, 2020, pp. 5185–5198 DOI: 10.18653/v1/2020.acl-main.463
  5. “Embers of Autoregression: Understanding Large Language Models Through the Problem They Are Trained to Solve” In arXiv:2309.13638 arXiv, 2023 DOI: 10.48550/arXiv.2309.13638
  6. “GLaM: Efficient Scaling of Language Models with Mixture-of-Experts” In International Conference on Machine Learning PMLR, 2022, pp. 5547–5569
  7. “PaLM: Scaling Language Modeling with Pathways” In arXiv:2204.02311, 2022 arXiv:2204.02311
  8. “Towards Expert-Level Medical Question Answering with Large Language Models” arXiv, 2023 DOI: 10.48550/arXiv.2305.09617
  9. “Towards Conversational Diagnostic AI” arXiv, 2024 DOI: 10.48550/arXiv.2401.05654
  10. “MEDITRON-70B: Scaling Medical Pretraining for Large Language Models” arXiv, 2023 DOI: 10.48550/arXiv.2311.16079
  11. “Med42 - a Clinical Large Language Model”, 2023
  12. OpenAI “GPT-4 Technical Report” In arXiv:2303.08774 arXiv, 2023 DOI: 10.48550/arXiv.2303.08774
  13. “Language Models Are Few-Shot Learners” In Proceedings of the 34th International Conference on Neural Information Processing Systems, NeurIPS’20 Red Hook, NY, USA: Curran Associates Inc., 2020, pp. 1877–1901
  14. “Llama 2: Open Foundation and Fine-Tuned Chat Models” In arXiv:2307.09288 arXiv, 2023 DOI: 10.48550/arXiv.2307.09288
  15. “What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams” In Applied Sciences 11.14 Multidisciplinary Digital Publishing Institute, 2021, pp. 6421 DOI: 10.3390/app11146421
  16. Ankit Pal, Logesh Kumar Umapathi and Malaikannan Sankarasubbu “MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering” In Proceedings of the Conference on Health, Inference, and Learning PMLR, 2022, pp. 248–260
  17. “Measuring Massive Multitask Language Understanding” In arXiv:2009.03300 arXiv, 2021 DOI: 10.48550/arXiv.2009.03300
  18. “Centrum Egzaminów Medycznych” URL: https://www.cem.edu.pl/
  19. “Training language models to follow instructions with human feedback” In Advances in Neural Information Processing Systems 35 Curran Associates, Inc., 2022, pp. 27730–27744 URL: https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
  20. “PaLM 2 Technical Report” In arXiv:2305.10403 arXiv, 2023 DOI: 10.48550/arXiv.2305.10403
  21. “Mixtral of Experts” arXiv, 2024 DOI: 10.48550/arXiv.2401.04088
  22. “Textbooks Are All You Need II: Phi-1.5 Technical Report” arXiv, 2023 DOI: 10.48550/arXiv.2309.05463
  23. “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers” In arXiv:2210.17323, 2022 arXiv:2210.17323
  24. “Large Language Models as Optimizers” In arXiv:2309.03409 arXiv, 2023 DOI: 10.48550/arXiv.2309.03409
  25. “Scaling Laws for Neural Language Models” In arXiv:2001.08361 arXiv, 2020 arXiv:2001.08361
  26. “Training Compute-Optimal Large Language Models” In arXiv:2203.15556, 2022 arXiv:2203.15556
  27. “Orca 2: Teaching Small Language Models How to Reason” arXiv, 2023 DOI: 10.48550/arXiv.2311.11045
  28. “Extracting Training Data from Large Language Models” In arXiv:2012.07805 arXiv, 2021 DOI: 10.48550/arXiv.2012.07805
  29. “Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets” In arXiv:2106.10328 arXiv, 2021 DOI: 10.48550/arXiv.2106.10328
  30. “Deduplicating Training Data Makes Language Models Better” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 8424–8445 DOI: 10.18653/v1/2022.acl-long.577
  31. R.Thomas McCoy, Ellie Pavlick and Tal Linzen “Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference” In arXiv:1902.01007, 2019 arXiv:1902.01007
  32. “Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions” In arXiv:2308.11483 arXiv, 2023 DOI: 10.48550/arXiv.2308.11483
  33. Bryan Wilder, Eric Horvitz and Ece Kamar “Learning to Complement Humans”, 2020 DOI: 10.48550/arXiv.2005.00582
  34. “Lost in the Middle: How Language Models Use Long Contexts” In arXiv:2307.03172 arXiv, 2023 arXiv:2307.03172
  35. “Language Models (Mostly) Know What They Know” arXiv, 2022 DOI: 10.48550/arXiv.2207.05221
  36. “How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering” In Transactions of the Association for Computational Linguistics 9, 2021, pp. 962–977 DOI: 10.1162/tacl_a_00407
  37. “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting” arXiv, 2023 DOI: 10.48550/arXiv.2310.11324
  38. “BBQ: A Hand-Built Bias Benchmark for Question Answering” In arXiv:2110.08193 arXiv, 2022 DOI: 10.48550/arXiv.2110.08193
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.