Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exploring Durham University Physics exams with Large Language Models

Published 27 Jun 2023 in physics.ed-ph | (2306.15609v1)

Abstract: The emergence of advanced NLP models like ChatGPT has raised concerns among universities regarding AI-driven exam completion. This paper provides a comprehensive evaluation of the proficiency of GPT-4 and GPT-3.5 in answering a set of 42 exam papers derived from 10 distinct physics courses, administered at Durham University over the span of 2018 to 2022, totalling 593 questions and 2504 available marks. These exams, spanning both undergraduate and postgraduate levels, include traditional pre-COVID and adaptive COVID-era formats. Questions from the years 2018-2020 were designed for pre-COVID in person adjudicated examinations whereas the 2021-2022 exams were set for varying COVID-adapted conditions including open-book conditions. To ensure a fair evaluation of AI performances, the exams completed by AI were assessed by the original exam markers. However, due to staffing constraints, only the aforementioned 593 out of the total 1280 questions were marked. GPT-4 and GPT-3.5 scored an average of 49.4\% and 38.6\%, respectively, suggesting only the weaker students would potential improve their marks if using AI. For exams from the pre-COVID era, the average scores for GPT-4 and GPT-3.5 were 50.8\% and 41.6\%, respectively. However, post-COVID, these dropped to 47.5\% and 33.6\%. Thus contrary to expectations, the change to less fact-based questions in the COVID era did not significantly impact AI performance for the state-of-the-art models such as GPT-4. These findings suggest that while current AI models struggle with university-level Physics questions, an improving trend is observable. The code used for automated AI completion is made publicly available for further research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  2. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198, 2023.
  3. Chatgpt goes to law school. Available at SSRN, 2023.
  4. The death of the short-form physics essay in the coming ai revolution. Physics Education, 58(3):035027, April 2023.
  5. Chatgpt and the frustrated socrates. Physics Education, 58(3):035021, March 2023.
  6. Gerd Kortemeyer. Could an artificial-intelligence agent pass an introductory physics course? Physical Review Physics Education Research, 19(1):010132, 2023.
  7. OpenAI. Best practices for prompt engineering with openai api, June 2023. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api.
  8. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  9. Gerd Kortemeyer. Could an artificial-intelligence agent pass an introductory physics course? Phys. Rev. Phys. Educ. Res., 19:010132, May 2023.
  10. Kay Lehnert. Ai insights into theoretical physics and the swampland program: A journey through the cosmos with chatgpt. arXiv preprint arXiv:2301.08155, 2023.
  11. Mathematical capabilities of chatgpt. arXiv preprint arXiv:2301.13867, 2023.
  12. An independent evaluation of chatgpt on mathematical word problems (mwp). arXiv preprint arXiv:2302.13814, 2023.
  13. Jinja2, April 2022. https://pypi.org/project/Jinja2/.
  14. James Andrew. Undergraduate module averages 2018 - a freedom of information request to university of durham, November 2018.
  15. Jon Ruislip. Undergraduate module averages 2019 - a freedom of information request to university of durham, August 2019.
  16. T Hall. Undergraduate module averages 2020 - a freedom of information request to university of durham, July 2021.
  17. Alex Ross. Durham university module averages 2021 - a freedom of information request to university of durham, November 2021.
  18. Human heuristics for ai-generated language are flawed. Proceedings of the National Academy of Sciences, 120(11):e2208839120, 2023.
  19. Mike Perkins. Academic integrity considerations of ai large language models in the post-pandemic era: Chatgpt and beyond. Journal of University Teaching & Learning Practice, 20(2):07, 2023.
  20. Artificial intelligence versus maya angelou: Experimental evidence that people cannot differentiate ai-generated from human-written poetry. Computers in human behavior, 114:106553, 2021.
  21. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
  22. Gpt detectors are biased against non-native english writers. arXiv preprint arXiv:2304.02819, 2023.
Citations (9)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.