• GPT-4, a large language model, demonstrates remarkable performance on medical competency examinations and benchmark datasets without specialized training.
  • The model outperforms earlier general-purpose models and those fine-tuned on medical knowledge, showing improved ability to predict the likelihood of correct answers.

Key terms:

  • USMLE: A three-step examination program used to assess clinical competency and grant licensure in the United States
  • MultiMedQA: A suite of benchmark datasets for evaluating model performance in the medical domain
  • Probability calibration: The ability of a model to predict the likelihood that its answers are correct, critical in high-stakes applications like medicine


Research GPT-4 GPT-3 AI in medicine USMLE Medicine Med-PaLM Flan-PaLM 540B MultiMedQA Medical Assessment