Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation (2310.03780v4)

Published 5 Oct 2023 in cs.AI

Abstract: Generative AI and LLMs hold great promise in enhancing programming education by automatically generating individualized feedback for students. We investigate the role of generative AI models in providing human tutor-style programming hints to help students resolve errors in their buggy programs. Recent works have benchmarked state-of-the-art models for various feedback generation scenarios; however, their overall quality is still inferior to human tutors and not yet ready for real-world deployment. In this paper, we seek to push the limits of generative AI models toward providing high-quality programming hints and develop a novel technique, GPT4Hints-GPT3.5Val. As a first step, our technique leverages GPT-4 as a tutor'' model to generate hints -- it boosts the generative quality by using symbolic information of failing test cases and fixes in prompts. As a next step, our technique leverages GPT-3.5, a weaker model, as astudent'' model to further validate the hint quality -- it performs an automatic quality validation by simulating the potential utility of providing this feedback. We show the efficacy of our technique via extensive evaluation using three real-world datasets of Python programs covering a variety of concepts ranging from basic algorithms to regular expressions and data analysis using pandas library.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. OpenAI. GPT-4 Technical Report. CoRR, abs/2303.08774, 2023.
  2. OpenAI. ChatGPT. https://openai.com/blog/chatgpt, 2023.
  3. Sébastien Bubeck et al. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. CoRR, abs/2303.12712, 2023.
  4. Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors. In ICER V.2, 2023.
  5. Adish Singla. Evaluating ChatGPT and GPT-4 for Visual Programming. In ICER V.2, 2023.
  6. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In ICER, 2022.
  7. Neural Task Synthesis for Visual Programming. CoRR, abs/2305.18342, 2023.
  8. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In SIGCSE, 2023.
  9. Repairing Bugs in Python Assignments Using Large Language Models. CoRR, abs/2209.14876, 2022.
  10. Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models. In EDM, 2023.
  11. Using Large Language Models to Enhance Programming Error Messages. In SIGCSE, 2023.
  12. GitHub. GitHub Copilot: Your AI Pair Programmer. https://github.com/features/copilot, 2022.
  13. Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. CoRR, abs/2210.14306, 2022.
  14. What is Your Biggest Pain Point? An Investigation of CS Instructor Obstacles, Workarounds, and Desires. In SIGCSE, 2023.
  15. Yejin Bang et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. CoRR, abs/2302.04023, 2023.
  16. Exploring the Potential of Large Language Models to Generate Formative Programming Feedback. In FIE, 2023.
  17. Am I Wrong, or Is the Autograder Wrong? Effects of AI Grading Mistakes on Learning. In ICER, 2023.
  18. Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS, 2022.
  19. Teaching Large Language Models to Self-Debug. CoRR, abs/2304.05128, 2023.
  20. Aman Madaan et al. Self-Refine: Iterative Refinement with Self-Feedback. CoRR, abs/2303.17651, 2023.
  21. Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR, abs/2307.09288, 2023.
  22. Wes McKinney et al. pandas: A Foundational Python Library for Data Analysis and Statistics. Python for High Performance and Scientific Computing, 14(9):1–9, 2011.
  23. Automated Feedback Generation for Introductory Programming Assignments. In PLDI, 2013.
  24. Automated Clustering and Program Repair for Introductory Programming Assignments. In PLDI, 2018.
  25. Writing Reusable Code Feedback at Scale with Mixed-Initiative Program Synthesis. In Learning @ Scale, 2017.
  26. Large Language Models (GPT) for Automating Feedback on Programming Assignments. CoRR, abs/2307.00150, 2023.
  27. OpenAI. Codex-Edit. https://beta.openai.com/playground?mode=edit&model=code-davinci-edit-001, 2022.
  28. Demystifying GPT Self-Repair for Code Generation. CoRR, abs/2306.09896, 2023.
  29. Khan Academy. Khanmigo. https://www.khanacademy.org/khan-labs, 2023.
  30. Quizlet. Q-chat. https://quizlet.com/qchat-personal-ai-tutor, 2023.
  31. Pygments. https://pygments.org/, 2006.
  32. Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? In ITiCSE, 2023.
  33. geeksforgeeks.org. GeeksforGeeks: A Computer Science Portal for Geeks. https://www.geeksforgeeks.org/, 2009.
  34. Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46, 1960.
  35. William G Cochran. The χ𝜒\chiitalic_χ2 Test of Goodness of Fit. The Annals of Mathematical Statistics, 1952.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tung Phung (3 papers)
  2. Victor-Alexandru Pădurean (9 papers)
  3. Anjali Singh (19 papers)
  4. Christopher Brooks (11 papers)
  5. José Cambronero (22 papers)
  6. Sumit Gulwani (55 papers)
  7. Adish Singla (96 papers)
  8. Gustavo Soares (21 papers)
Citations (26)
Github Logo Streamline Icon: https://streamlinehq.com