Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation (2310.03780v4)
Abstract: Generative AI and LLMs hold great promise in enhancing programming education by automatically generating individualized feedback for students. We investigate the role of generative AI models in providing human tutor-style programming hints to help students resolve errors in their buggy programs. Recent works have benchmarked state-of-the-art models for various feedback generation scenarios; however, their overall quality is still inferior to human tutors and not yet ready for real-world deployment. In this paper, we seek to push the limits of generative AI models toward providing high-quality programming hints and develop a novel technique, GPT4Hints-GPT3.5Val. As a first step, our technique leverages GPT-4 as a tutor'' model to generate hints -- it boosts the generative quality by using symbolic information of failing test cases and fixes in prompts. As a next step, our technique leverages GPT-3.5, a weaker model, as a
student'' model to further validate the hint quality -- it performs an automatic quality validation by simulating the potential utility of providing this feedback. We show the efficacy of our technique via extensive evaluation using three real-world datasets of Python programs covering a variety of concepts ranging from basic algorithms to regular expressions and data analysis using pandas library.
- OpenAI. GPT-4 Technical Report. CoRR, abs/2303.08774, 2023.
- OpenAI. ChatGPT. https://openai.com/blog/chatgpt, 2023.
- Sébastien Bubeck et al. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. CoRR, abs/2303.12712, 2023.
- Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors. In ICER V.2, 2023.
- Adish Singla. Evaluating ChatGPT and GPT-4 for Visual Programming. In ICER V.2, 2023.
- Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In ICER, 2022.
- Neural Task Synthesis for Visual Programming. CoRR, abs/2305.18342, 2023.
- Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In SIGCSE, 2023.
- Repairing Bugs in Python Assignments Using Large Language Models. CoRR, abs/2209.14876, 2022.
- Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models. In EDM, 2023.
- Using Large Language Models to Enhance Programming Error Messages. In SIGCSE, 2023.
- GitHub. GitHub Copilot: Your AI Pair Programmer. https://github.com/features/copilot, 2022.
- Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. CoRR, abs/2210.14306, 2022.
- What is Your Biggest Pain Point? An Investigation of CS Instructor Obstacles, Workarounds, and Desires. In SIGCSE, 2023.
- Yejin Bang et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. CoRR, abs/2302.04023, 2023.
- Exploring the Potential of Large Language Models to Generate Formative Programming Feedback. In FIE, 2023.
- Am I Wrong, or Is the Autograder Wrong? Effects of AI Grading Mistakes on Learning. In ICER, 2023.
- Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS, 2022.
- Teaching Large Language Models to Self-Debug. CoRR, abs/2304.05128, 2023.
- Aman Madaan et al. Self-Refine: Iterative Refinement with Self-Feedback. CoRR, abs/2303.17651, 2023.
- Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR, abs/2307.09288, 2023.
- Wes McKinney et al. pandas: A Foundational Python Library for Data Analysis and Statistics. Python for High Performance and Scientific Computing, 14(9):1–9, 2011.
- Automated Feedback Generation for Introductory Programming Assignments. In PLDI, 2013.
- Automated Clustering and Program Repair for Introductory Programming Assignments. In PLDI, 2018.
- Writing Reusable Code Feedback at Scale with Mixed-Initiative Program Synthesis. In Learning @ Scale, 2017.
- Large Language Models (GPT) for Automating Feedback on Programming Assignments. CoRR, abs/2307.00150, 2023.
- OpenAI. Codex-Edit. https://beta.openai.com/playground?mode=edit&model=code-davinci-edit-001, 2022.
- Demystifying GPT Self-Repair for Code Generation. CoRR, abs/2306.09896, 2023.
- Khan Academy. Khanmigo. https://www.khanacademy.org/khan-labs, 2023.
- Quizlet. Q-chat. https://quizlet.com/qchat-personal-ai-tutor, 2023.
- Pygments. https://pygments.org/, 2006.
- Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? In ITiCSE, 2023.
- geeksforgeeks.org. GeeksforGeeks: A Computer Science Portal for Geeks. https://www.geeksforgeeks.org/, 2009.
- Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46, 1960.
- William G Cochran. The χ𝜒\chiitalic_χ2 Test of Goodness of Fit. The Annals of Mathematical Statistics, 1952.
- Tung Phung (3 papers)
- Victor-Alexandru Pădurean (9 papers)
- Anjali Singh (19 papers)
- Christopher Brooks (11 papers)
- José Cambronero (22 papers)
- Sumit Gulwani (55 papers)
- Adish Singla (96 papers)
- Gustavo Soares (21 papers)