Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Feedback-Generation for Programming Exercises With GPT-4 (2403.04449v2)

Published 7 Mar 2024 in cs.AI

Abstract: Ever since LLMs and related applications have become broadly available, several studies investigated their potential for assisting educators and supporting students in higher education. LLMs such as Codex, GPT-3.5, and GPT 4 have shown promising results in the context of large programming courses, where students can benefit from feedback and hints if provided timely and at scale. This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. Two assignments from an introductory programming course were selected, and GPT-4 was asked to generate feedback for 55 randomly chosen, authentic student programming submissions. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material. Compared to prior work and analyses of GPT-3.5, GPT-4 Turbo shows notable improvements. For example, the output is more structured and consistent. GPT-4 Turbo can also accurately identify invalid casing in student programs' output. In some cases, the feedback also includes the output of the student program. At the same time, inconsistent feedback was noted such as stating that the submission is correct but an error needs to be fixed. The present work increases our understanding of LLMs' potential, limitations, and how to integrate them into e-assessment systems, pedagogical scenarios, and instructing students who are using applications based on GPT-4.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. AI-enhanced Auto-Correction of Programming Exercises: How Effective is GPT-3.5? International Journal of Engineering Pedagogy (iJEP) 13, 8 (Dec. 2023), 67–83. https://doi.org/10.3991/ijep.v13i8.45621
  2. Generative AI in Introductory Programming. (2023). https://csed.acm.org/wp-content/uploads/2023/12/Generative-AI-Nov-2023-Version.pdf
  3. Compiler Error Messages Considered Unhelpful. In Proc. ITiCSE-WGR. ACM. https://doi.org/10.1145/3344429.3372508
  4. Douglas Bengtsson and Axel Kaliff. 2023. Assessment Accuracy of a Large Language Model on Programming Assignments. https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-331000
  5. Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/1478088706qp063oa
  6. Computing Education in the Era of Generative AI. arXiv:2306.02608 [cs.CY] https://doi.org/10.48550/arXiv.2306.02608
  7. Benedict Du Boulay. 1986. Some difficulties of learning to program. Journal of Educational Computing Research 2, 1 (1986), 57–73. https://doi.org/10.2190/3LFX-9RRF-67T8-UVK9
  8. Michael Ebert and Markus Ring. 2016. A presentation framework for programming in programing lectures. In Proc. EDUCON. IEEE, 369–374.
  9. The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In Proc. ACE. 10–19. https://doi.org/10.1145/3511861.3511863
  10. My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In Proc. ACE. 97–104. https://doi.org/10.1145/3576123.3576134
  11. Towards understanding the effective design of automated formative feedback for programming assignments. Computer Science Education 32, 1 (2022), 105–127.
  12. Towards Giving Timely Formative Feedback and Hints to Novice Programmers. In Proceedings of the 2022 Working Group Reports on Innovation and Technology in Computer Science Education (Dublin, Ireland) (ITiCSE-WGR ’22). ACM, New York, 95–115. https://doi.org/10.1145/3571785.3574124
  13. A Tutoring System to Learn Code Refactoring. In Proc. SIGCSE. ACM. https://doi.org/10.1145/3408877.3432526
  14. A Systematic Literature Review of Automated Feedback Generation for Programming Exercises. TOCE 19, 1, Article 3 (2018). https://doi.org/10.1145/3231711
  15. Natalie Kiesler. 2020. Towards a Competence Model for the Novice Programmer Using Bloom’s Revised Taxonomy – An Empirical Approach. In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education (Trondheim, Norway) (ITiCSE ’20). ACM, New York, 459–465. https://doi.org/10.1145/3341525.3387419
  16. Natalie Kiesler. 2024. Modeling Programming Competency : A Qualitative Analysis. Springer International Publishing, Cham. 165 pages. https://doi.org/10.1007/978-3-031-47148-3
  17. Exploring the Potential of Large Language Models to Generate Formative Programming Feedback. In 2023 IEEE Frontiers in Education Conference (FIE). 1–5. https://doi.org/10.1109/FIE58773.2023.10343457
  18. Natalie Kiesler and Daniel Schiffner. 2023. Large Language Models in Introductory Programming Education: ChatGPT’s Performance and Implications for Assessments. In CoRR abs/2308.08572. arXiv: 2308.08572. https://doi.org/10.48550/arXiv.2308.08572 arXiv:2308.08572 [cs.SE]
  19. A Review of AI-Supported Tutoring Approaches for Learning Programming. In Advanced Computational Methods for Knowledge Engineering - Proceedings of the 1st International Conference on Computer Science, Applied Mathematics and Applications (ICCSAMA) (Studies in Computational Intelligence, 479). Springer Verlag, Berlin, Germany, 267–279. https://doi.org/10.1007/978-3-319-00293-4_20
  20. Comparing Code Explanations Created by Students and Large Language Models. In Proc. ITiCSE. 124–130. https://doi.org/10.1145/3587102.3588785
  21. Using Large Language Models to Enhance Programming Error Messages. In Proc. SIGCSE. ACM. https://doi.org/10.1145/3545945.3569770
  22. Xiao Liu and Gyun Woo. 2020. Applying Code Quality Detection in Online Programming Judge. In Proc. International Conference on Intelligent Information Technology (ICIIT 2020). ACM. https://doi.org/10.1145/3385209.3385226
  23. Andrew Luxton-Reilly. 2016. Learning to Program is Easy. In Proc. ITiCSE (ITiCSE ’16). 284–289. https://doi.org/10.1145/2899415.2899432
  24. Introductory Programming: A Systematic Literature Review. In Proc. ITiCSE. ACM, New York, 55–106. https://doi.org/10.1145/3293881.3295779
  25. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In Proc. SIGCSE TS. 931–937. https://doi.org/10.1145/3545945.3569785
  26. Philipp Mayring. 2001. Combination and Integration of Qualitative and Quantitative Analysis. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research Vol 2 (2001). https://doi.org/10.17169/FQS-2.1.967
  27. Susanne Narciss. 2008. Feedback strategies for interactive learning tasks. Handbook of research on educational communications and technology 3 (2008), 125–144.
  28. OpenAI. [n. d.]. GPT-4 Turbo. https://help.openai.com/en/articles/8555510-gpt-4-turbo
  29. Revisiting why students drop CS1. In Proc. Koli Calling. 71–80. https://doi.org/10.1145/2999541.2999552
  30. The Robots Are Here: Navigating the Generative AI Revolution in Computing Education. In Proceedings of the 2023 Working Group Reports on Innovation and Technology in Computer Science Education (Turku, Finland) (ITiCSE-WGR ’23). ACM, New York, 108–159. https://doi.org/10.1145/3623762.3633499
  31. Next-Step Hint Generation for Introductory Programming Using Large Language Models. arXiv:2312.10055 [cs.CY]
  32. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In Proc. ICER. ACM. https://doi.org/10.1145/3501385.3543957
  33. Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code. arXiv:2303.08033 [cs.CL]
  34. Valerie J. Shute. 2008. Focus on formative feedback. Review of Educational Research 78, 1 (2008). https://doi.org/10.3102/0034654307313795
  35. James C. Spohrer and Elliot Soloway. 1986. Novice mistakes: Are the folk wisdoms correct? Communications of the ACM 29, 7 (1986), 624–632. https://doi.org/10.1145/6138.6145
  36. Sven Strickroth. 2023. Does Peer Code Review Change My Mind on My Submission?. In Proc. ITiCSE. 498–504. https://doi.org/10.1145/3587102.3588802
  37. Sven Strickroth and François Bry. 2022. The Future of Higher Education is Social and Personalized! Experience Report and Perspectives. In Proc. CSEDU, Vol. 1. 389–396. https://doi.org/10.5220/0011087700003182
  38. Sven Strickroth and Florian Holzinger. 2022. Supporting the Semi-Automatic Feedback Provisioning on Programming Assignments. In Proc. MIS4TEL. Springer International Publishing, Cham, 13–19. https://doi.org/10.1007/978-3-031-20617-7_3
  39. Das GATE-System: Qualitätssteigerung durch Selbsttests für Studenten bei der Onlineabgabe von Übungsaufgaben?. In Proc. DeLFI. Gesellschaft für Informatik e.V., Bonn, 115–126. https://dl.gi.de/handle/20.500.12116/4740
  40. Sven Strickroth and Michael Striewe. 2022. Building a Corpus of Task-Based Grading and Feedback Systems for Learning and Teaching Programming. International Journal of Engineering Pedagogy (iJEP) 12, 5 (2022), 26–41. https://doi.org/10.3991/ijep.v12i5.31283
  41. John Sweller. 1994. Cognitive load theory, learning difficulty, and instructional design. Learning and instruction 4, 4 (1994), 295–312. https://doi.org/10.1016/0959-4752(94)90003-5
  42. The many ways of the Bracelet project. BACIT (2007).
  43. Large Language Models in Fault Localisation. arXiv:2308.15276 [cs.SE]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Imen Azaiz (3 papers)
  2. Natalie Kiesler (17 papers)
  3. Sven Strickroth (6 papers)
Citations (9)