Dice Question Streamline Icon: https://streamlinehq.com

Unknown training data for GPT-4 and GPT-3.5 and potential MedQA contamination

Determine the training datasets used to pretrain OpenAI GPT-4 (gpt-4-0613) and GPT-3.5 (gpt-3.5-turbo-0613), including whether the MedQA medical question-answering benchmark test or training sets were included, to assess potential data contamination and its impact on reported performance in AgentClinic-MedQA evaluations.

Information Square Streamline Icon: https://streamlinehq.com

Background

The AgentClinic-MedQA benchmark evaluates doctor agents powered by various LLMs, including GPT-4 and GPT-3.5. The validity of comparative results depends on whether these models were exposed to benchmark items during pretraining.

The authors note that the training data for GPT-4 and GPT-3.5 has not been disclosed, raising the possibility that these models could have been trained on the MedQA test set, which would confer an unfair advantage. In contrast, open models like Mixtral-8x7B and Llama 2-70B-Chat do not report using the MedQA test or train sets.

Clarifying whether GPT-4 and GPT-3.5 training corpora included MedQA is essential for interpreting diagnostic accuracy results and for ensuring fair comparisons across models on AgentClinic-MedQA.

References

One limitation for the evaluations presented in this benchmark is that it is currently unknown what data was used to train GPT-4 and GPT-3.5. While previous works have cited GPT-4s accuracy as a valid measure, it is entirely possible that GPT-4/3.5 could have been trained on the MedQA test set giving it an unfair advantage on the task.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments (2405.07960 - Schmidgall et al., 13 May 2024) in Discussion (paragraph beginning “One limitation for the evaluations…”).