Cause of GPT-4’s notably strong performance in the Matter at Extremes module

Investigate the factors underlying GPT-4’s strong performance on assessments in the Matter at Extremes module (covering topics such as particle colliders and superconductivity), specifically testing whether this is due to a limited variety of canonical question forms or to fortuitous alignment with the distributions present in GPT-4’s training data.

Background

In the Matter at Extremes module, GPT-4 achieved a first-class overall mark (85%), performing strongly on computational and explanatory tasks despite persistent issues with diagrammatic questions and some equations embedded as images. This contrasts with weaker performance on multi-step reasoning tasks in other modules.

The authors explicitly state uncertainty about why GPT-4’s answers were of unusually high quality in this module, suggesting possible explanations such as limited question diversity or alignment with the model’s training data, and thereby motivating a targeted investigation.

References

We were impressed by the quality of answers here but are are unclear why this might be the case - perhaps a limited set of questions (or variations thereof) exist, or perhaps the nature of the questions is fortuitously coincidental with GPT-4's training set.

— Can ChatGPT pass a physics degree? Making a case for reformation of assessment of undergraduate degrees (2412.01312 - Pimbblet et al., 2 Dec 2024) in Section 4, “Matter at Extremes”

Cause of GPT-4’s notably strong performance in the Matter at Extremes module

Background

References

Related Problems