Feedback-Generation for Programming Exercises With GPT-4 (2403.04449v2)
Abstract: Ever since LLMs and related applications have become broadly available, several studies investigated their potential for assisting educators and supporting students in higher education. LLMs such as Codex, GPT-3.5, and GPT 4 have shown promising results in the context of large programming courses, where students can benefit from feedback and hints if provided timely and at scale. This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. Two assignments from an introductory programming course were selected, and GPT-4 was asked to generate feedback for 55 randomly chosen, authentic student programming submissions. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material. Compared to prior work and analyses of GPT-3.5, GPT-4 Turbo shows notable improvements. For example, the output is more structured and consistent. GPT-4 Turbo can also accurately identify invalid casing in student programs' output. In some cases, the feedback also includes the output of the student program. At the same time, inconsistent feedback was noted such as stating that the submission is correct but an error needs to be fixed. The present work increases our understanding of LLMs' potential, limitations, and how to integrate them into e-assessment systems, pedagogical scenarios, and instructing students who are using applications based on GPT-4.
- AI-enhanced Auto-Correction of Programming Exercises: How Effective is GPT-3.5? International Journal of Engineering Pedagogy (iJEP) 13, 8 (Dec. 2023), 67–83. https://doi.org/10.3991/ijep.v13i8.45621
- Generative AI in Introductory Programming. (2023). https://csed.acm.org/wp-content/uploads/2023/12/Generative-AI-Nov-2023-Version.pdf
- Compiler Error Messages Considered Unhelpful. In Proc. ITiCSE-WGR. ACM. https://doi.org/10.1145/3344429.3372508
- Douglas Bengtsson and Axel Kaliff. 2023. Assessment Accuracy of a Large Language Model on Programming Assignments. https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-331000
- Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/1478088706qp063oa
- Computing Education in the Era of Generative AI. arXiv:2306.02608 [cs.CY] https://doi.org/10.48550/arXiv.2306.02608
- Benedict Du Boulay. 1986. Some difficulties of learning to program. Journal of Educational Computing Research 2, 1 (1986), 57–73. https://doi.org/10.2190/3LFX-9RRF-67T8-UVK9
- Michael Ebert and Markus Ring. 2016. A presentation framework for programming in programing lectures. In Proc. EDUCON. IEEE, 369–374.
- The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In Proc. ACE. 10–19. https://doi.org/10.1145/3511861.3511863
- My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In Proc. ACE. 97–104. https://doi.org/10.1145/3576123.3576134
- Towards understanding the effective design of automated formative feedback for programming assignments. Computer Science Education 32, 1 (2022), 105–127.
- Towards Giving Timely Formative Feedback and Hints to Novice Programmers. In Proceedings of the 2022 Working Group Reports on Innovation and Technology in Computer Science Education (Dublin, Ireland) (ITiCSE-WGR ’22). ACM, New York, 95–115. https://doi.org/10.1145/3571785.3574124
- A Tutoring System to Learn Code Refactoring. In Proc. SIGCSE. ACM. https://doi.org/10.1145/3408877.3432526
- A Systematic Literature Review of Automated Feedback Generation for Programming Exercises. TOCE 19, 1, Article 3 (2018). https://doi.org/10.1145/3231711
- Natalie Kiesler. 2020. Towards a Competence Model for the Novice Programmer Using Bloom’s Revised Taxonomy – An Empirical Approach. In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education (Trondheim, Norway) (ITiCSE ’20). ACM, New York, 459–465. https://doi.org/10.1145/3341525.3387419
- Natalie Kiesler. 2024. Modeling Programming Competency : A Qualitative Analysis. Springer International Publishing, Cham. 165 pages. https://doi.org/10.1007/978-3-031-47148-3
- Exploring the Potential of Large Language Models to Generate Formative Programming Feedback. In 2023 IEEE Frontiers in Education Conference (FIE). 1–5. https://doi.org/10.1109/FIE58773.2023.10343457
- Natalie Kiesler and Daniel Schiffner. 2023. Large Language Models in Introductory Programming Education: ChatGPT’s Performance and Implications for Assessments. In CoRR abs/2308.08572. arXiv: 2308.08572. https://doi.org/10.48550/arXiv.2308.08572 arXiv:2308.08572 [cs.SE]
- A Review of AI-Supported Tutoring Approaches for Learning Programming. In Advanced Computational Methods for Knowledge Engineering - Proceedings of the 1st International Conference on Computer Science, Applied Mathematics and Applications (ICCSAMA) (Studies in Computational Intelligence, 479). Springer Verlag, Berlin, Germany, 267–279. https://doi.org/10.1007/978-3-319-00293-4_20
- Comparing Code Explanations Created by Students and Large Language Models. In Proc. ITiCSE. 124–130. https://doi.org/10.1145/3587102.3588785
- Using Large Language Models to Enhance Programming Error Messages. In Proc. SIGCSE. ACM. https://doi.org/10.1145/3545945.3569770
- Xiao Liu and Gyun Woo. 2020. Applying Code Quality Detection in Online Programming Judge. In Proc. International Conference on Intelligent Information Technology (ICIIT 2020). ACM. https://doi.org/10.1145/3385209.3385226
- Andrew Luxton-Reilly. 2016. Learning to Program is Easy. In Proc. ITiCSE (ITiCSE ’16). 284–289. https://doi.org/10.1145/2899415.2899432
- Introductory Programming: A Systematic Literature Review. In Proc. ITiCSE. ACM, New York, 55–106. https://doi.org/10.1145/3293881.3295779
- Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In Proc. SIGCSE TS. 931–937. https://doi.org/10.1145/3545945.3569785
- Philipp Mayring. 2001. Combination and Integration of Qualitative and Quantitative Analysis. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research Vol 2 (2001). https://doi.org/10.17169/FQS-2.1.967
- Susanne Narciss. 2008. Feedback strategies for interactive learning tasks. Handbook of research on educational communications and technology 3 (2008), 125–144.
- OpenAI. [n. d.]. GPT-4 Turbo. https://help.openai.com/en/articles/8555510-gpt-4-turbo
- Revisiting why students drop CS1. In Proc. Koli Calling. 71–80. https://doi.org/10.1145/2999541.2999552
- The Robots Are Here: Navigating the Generative AI Revolution in Computing Education. In Proceedings of the 2023 Working Group Reports on Innovation and Technology in Computer Science Education (Turku, Finland) (ITiCSE-WGR ’23). ACM, New York, 108–159. https://doi.org/10.1145/3623762.3633499
- Next-Step Hint Generation for Introductory Programming Using Large Language Models. arXiv:2312.10055 [cs.CY]
- Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In Proc. ICER. ACM. https://doi.org/10.1145/3501385.3543957
- Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code. arXiv:2303.08033 [cs.CL]
- Valerie J. Shute. 2008. Focus on formative feedback. Review of Educational Research 78, 1 (2008). https://doi.org/10.3102/0034654307313795
- James C. Spohrer and Elliot Soloway. 1986. Novice mistakes: Are the folk wisdoms correct? Communications of the ACM 29, 7 (1986), 624–632. https://doi.org/10.1145/6138.6145
- Sven Strickroth. 2023. Does Peer Code Review Change My Mind on My Submission?. In Proc. ITiCSE. 498–504. https://doi.org/10.1145/3587102.3588802
- Sven Strickroth and François Bry. 2022. The Future of Higher Education is Social and Personalized! Experience Report and Perspectives. In Proc. CSEDU, Vol. 1. 389–396. https://doi.org/10.5220/0011087700003182
- Sven Strickroth and Florian Holzinger. 2022. Supporting the Semi-Automatic Feedback Provisioning on Programming Assignments. In Proc. MIS4TEL. Springer International Publishing, Cham, 13–19. https://doi.org/10.1007/978-3-031-20617-7_3
- Das GATE-System: Qualitätssteigerung durch Selbsttests für Studenten bei der Onlineabgabe von Übungsaufgaben?. In Proc. DeLFI. Gesellschaft für Informatik e.V., Bonn, 115–126. https://dl.gi.de/handle/20.500.12116/4740
- Sven Strickroth and Michael Striewe. 2022. Building a Corpus of Task-Based Grading and Feedback Systems for Learning and Teaching Programming. International Journal of Engineering Pedagogy (iJEP) 12, 5 (2022), 26–41. https://doi.org/10.3991/ijep.v12i5.31283
- John Sweller. 1994. Cognitive load theory, learning difficulty, and instructional design. Learning and instruction 4, 4 (1994), 295–312. https://doi.org/10.1016/0959-4752(94)90003-5
- The many ways of the Bracelet project. BACIT (2007).
- Large Language Models in Fault Localisation. arXiv:2308.15276 [cs.SE]
- Imen Azaiz (3 papers)
- Natalie Kiesler (17 papers)
- Sven Strickroth (6 papers)