Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses (2306.10073v1)
Abstract: This paper studies recent developments in LLMs' (LLM) abilities to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. The emergence of ChatGPT resulted in heated debates of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming classes (e.g., cheating). Recent studies show that while the technology performs surprisingly well on diverse sets of assessment instruments employed in typical programming classes the performance is usually not sufficient to pass the courses. The release of GPT-4 largely emphasized notable improvements in the capabilities related to handling assessments originally designed for human test-takers. This study is the necessary analysis in the context of this ongoing transition towards mature generative AI systems. Specifically, we report the performance of GPT-4, comparing it to the previous generations of GPT models, on three Python courses with assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Additionally, we analyze the assessments that were not handled well by GPT-4 to understand the current limitations of the model, as well as its capabilities to leverage feedback provided by an auto-grader. We found that the GPT models evolved from completely failing the typical programming class' assessments (the original GPT-3) to confidently passing the courses with no human involvement (GPT-4). While we identified certain limitations in GPT-4's handling of MCQs and coding exercises, the rate of improvement across the recent generations of GPT models strongly suggests their potential to handle almost any type of assessment widely used in higher education programming courses. These findings could be leveraged by educators and institutions to adapt the design of programming assessments as well as to fuel the necessary discussions into how programming classes should be updated to reflect the recent technological developments. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use widely accessible technology that can be utilized by learners to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments.
- Programming Is Hard–Or at Least It Used to Be: Educational Opportunities And Challenges of AI Code Generation. arXiv preprint arXiv:abs/2212.01020 (2022).
- GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI) CPA Capabilities. arXiv preprint arXiv:abs/2301.04408 (2023).
- Michael Bommarito II and Daniel Martin Katz. 2022. GPT Takes the Bar Exam. arXiv preprint arXiv:abs/2212.14402 (2022).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712 [cs.CL]
- Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:abs/2107.03374 (2021). https://doi.org/10.48550/ARXIV.2107.03374
- Conversing with Copilot: Exploring prompt engineering for solving CS1 problems using natural language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 1136–1142.
- A Preliminary Analysis on the Code Generation Capabilities of GPT-3.5 and Bard AI Models for Java Functions. arXiv preprint arXiv:2305.09402 (2023).
- Iddo Drori and Nakul Verma. 2021. Solving Linear Algebra by Program Synthesis. arXiv preprint arXiv:2111.08171 (2021). https://doi.org/10.48550/ARXIV.2111.08171
- The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In Australasian Computing Education Conference (Virtual Event, Australia) (ACE ’22). Association for Computing Machinery, New York, NY, USA, 10–19. https://doi.org/10.1145/3511861.3511863
- My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In Proceedings of the 25th Australasian Computing Education Conference. 97–104.
- Making Pre-trained Language Models Better Few-shot Learners. arXiv:2012.15723 [cs.CL]
- How Well Does ChatGPT Do When Taking the Medical Licensing Exams? The Implications of Large Language Models for Medical Education and Knowledge Assessment. In medRxiv. https://doi.org/10.1101/2022.12.23.22283901.
- Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:abs/2009.03300 (2020). https://doi.org/10.48550/ARXIV.2009.03300
- ChatGPT and Software Testing Education: Promises & Perils. https://doi.org/10.48550/arXiv.2302.03287 arXiv:2302.03287 [cs.SE]
- Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. arXiv:2301.08745 [cs.CL]
- Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation. ArXiv abs/2212.02684 (2022).
- GPT-4 Passes the Bar Exam. Available at SSRN 4389233 (2023).
- Studying the Effect of AI Code Generators on Supporting Novice Learners in Introductory Programming. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 455, 23 pages. https://doi.org/10.1145/3544548.3580919
- Donald Ervin Knuth. 1984. Literate programming. The computer journal 27, 2 (1984), 97–111.
- Project-based learning: A review of the literature. Improving schools 19, 3 (2016), 267–277.
- Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. medRxiv preprint (2022). https://doi.org/10.1101/2022.12.19.22283643.
- Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:abs/1704.04683 (2017).
- Comparing Code Explanations Created by Students and Large Language Models. https://doi.org/10.48550/arXiv.2304.03938 arXiv:2304.03938 [cs.CY]
- Using Large Language Models to Enhance Programming Error Messages. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 563–569. https://doi.org/10.1145/3545945.3569770
- Competition-level code generation with AlphaCode. Science 378, 6624 (2022), 1092–1097. https://doi.org/10.1126/science.abq1158 arXiv:https://www.science.org/doi/pdf/10.1126/science.abq1158
- Can large language models reason about medical questions? ArXiv preprint arXiv:abs/2207.08143 (2022).
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. https://doi.org/10.48550/ARXIV.2209.09513
- Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 931–937. https://doi.org/10.1145/3545945.3569785
- Augmented Language Models: a Survey. arXiv preprint arXiv:2302.07842 (2023).
- Can a suit of armor conduct electricity? A new dataset for open book question answering. arXiv preprint arXiv:abs/1809.02789 (2018).
- Explainable AI for Pre-Trained Code Models: What Do They Learn? When They Do Not Work? ArXiv preprint arXiv:abs/2211.12821 (2022).
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 839–849.
- Nhan Nguyen and Sarah Nadi. 2022. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). 1–5. https://doi.org/10.1145/3524842.3528470
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:abs/2203.02155 (2022).
- Do Users Write More Insecure Code with AI Assistants? arXiv preprint arXiv:2211.03622 (2022).
- Many bioinformatics programming tasks can be automated with ChatGPT. arXiv preprint arXiv:2303.13528 (2023).
- ” It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers. arXiv preprint arXiv:2304.02491 (2023).
- Improving language understanding by generative pre-training. (2018).
- Language models are unsupervised multitask learners. (2019).
- Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. arXiv:2102.07350 [cs.CL]
- Leveraging Large Language Models for Multiple Choice Question Answering. arXiv preprint arXiv:abs/2210.12353 (2022). https://doi.org/10.48550/ARXIV.2210.12353
- Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In Proceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1 (Lugano and Virtual Event, Switzerland) (ICER ’22). Association for Computing Machinery, New York, NY, USA, 27–43. https://doi.org/10.1145/3501385.3543957
- Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code. arXiv preprint arXiv:2303.08033 (2023).
- Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?. In Proceedings of the 28th Annual ACM Conference on Innovation and Technology in Computer Science Education.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv abs/2201.11903 (2022).
- How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective. ArXiv abs/2211.15844 (2022).
- Prompting Large Language Model for Machine Translation: A Case Study. ArXiv abs/2301.07069 (2023).
- Calibrate Before Use: Improving Few-Shot Performance of Language Models. arXiv:2102.09690 [cs.CL]
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ArXiv abs/2205.10625 (2022).
- Jaromir Savelka (47 papers)
- Arav Agarwal (10 papers)
- Marshall An (1 paper)
- Chris Bogart (4 papers)
- Majd Sakr (14 papers)