Unknown training data for GPT-4 and GPT-3.5 and potential MedQA contamination
Determine the training datasets used to pretrain OpenAI GPT-4 (gpt-4-0613) and GPT-3.5 (gpt-3.5-turbo-0613), including whether the MedQA medical question-answering benchmark test or training sets were included, to assess potential data contamination and its impact on reported performance in AgentClinic-MedQA evaluations.
References
One limitation for the evaluations presented in this benchmark is that it is currently unknown what data was used to train GPT-4 and GPT-3.5. While previous works have cited GPT-4s accuracy as a valid measure, it is entirely possible that GPT-4/3.5 could have been trained on the MedQA test set giving it an unfair advantage on the task.
— AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
(2405.07960 - Schmidgall et al., 13 May 2024) in Discussion (paragraph beginning “One limitation for the evaluations…”).