Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging (2310.05292v5)

Published 8 Oct 2023 in cs.HC and cs.SE

Abstract: LLMs now excel at generative skills and can create content at impeccable speeds. However, they are imperfect and still make various mistakes. In a Computer Science education context, as these models are widely recognized as "AI pair programmers," it becomes increasingly important to train students on evaluating and debugging the LLM-generated code. In this work, we introduce HypoCompass, a novel system to facilitate deliberate practice on debugging, where human novices play the role of Teaching Assistants and help LLM-powered teachable agents debug code. We enable effective task delegation between students and LLMs in this learning-by-teaching environment: students focus on hypothesizing the cause of code errors, while adjacent skills like code completion are offloaded to LLM-agents. Our evaluations demonstrate that HypoCompass generates high-quality training materials (e.g., bugs and fixes), outperforming human counterparts fourfold in efficiency, and significantly improves student performance on debugging by 12% in the pre-to-post test.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. 2014. Why Don’t Schools Teach Debugging? https://news.ycombinator.com/item?id=7215870. https://news.ycombinator.com/item?id=7215870 Accessed: 2023-9-8.
  2. Prediction of Learner Information-Seeking Behavior and Classroom Engagement in the Advent of ChatGPT. In International Conference on Smart Learning Environments. Springer, 117–126.
  3. Integrated and Tool-Supported Teaching of Testing, Debugging, and Verification. In Teaching Formal Methods. Springer Berlin Heidelberg, 125–143. https://doi.org/10.1007/978-3-642-04912-5_9
  4. Self-Supervised Bug Detection and Repair. (May 2021). arXiv:2105.12787 [cs.LG] http://arxiv.org/abs/2105.12787
  5. A general framework for debugging. IEEE Softw. 8, 3 (May 1991), 14–20. https://doi.org/10.1109/52.88939
  6. Reusing Bugged Source Code to Support Novice Programmers in Debugging Tasks. ACM Trans. Comput. Educ. 20, 1 (Nov. 2019), 1–24. https://doi.org/10.1145/3355616
  7. Program Synthesis with Large Language Models. ArXiv (2021). https://www.semanticscholar.org/paper/a38e0f993e4805ba8a9beae4c275c91ffcec01df
  8. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.
  9. Programming is hard-or at least it used to be: Educational opportunities and challenges of ai code generation. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 500–506.
  10. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue 20, 6 (2022), 35–57.
  11. Pedagogical Agents for Learning by Teaching: Teachable Agents. Educ. Technol. Res. Dev. 47, 1 (2007), 56–61. http://www.jstor.org/stable/44429380
  12. Sparks of Artificial General Intelligence: Early experiments with GPT-4. (March 2023). arXiv:2303.12712 [cs.CL] http://arxiv.org/abs/2303.12712
  13. Guillermo Campitelli and Fernand Gobet. 2011. Deliberate Practice: Necessary But Not Sufficient. Curr. Dir. Psychol. Sci. 20, 5 (Oct. 2011), 280–285. https://doi.org/10.1177/0963721411421922
  14. Elizabeth Emily Carter. 2014. An Intelligent Debugging Tutor For Novice Computer Science Students. Ph. D. Dissertation. Lehigh University. https://core.ac.uk/download/pdf/228636792.pdf
  15. Venera-Mihaela Cojocariu and Gabriel Mareş. 2022. Academic Integrity in the Technology-Driven Education Era. In Ethical Use of Information Technology in Higher Education, Liliana Mâță (Ed.). Springer Singapore, Singapore, 1–16. https://doi.org/10.1007/978-981-16-1951-9_1
  16. GitHub Copilot. 2021. Your AI pair programmer.
  17. Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT. (April 2023). https://doi.org/10.35542/osf.io/hcgzj
  18. GitHub Copilot AI pair programmer: Asset or Liability? ArXiv (2022). https://doi.org/10.48550/ARXIV.2206.15331
  19. Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 1136–1142. https://doi.org/10.1145/3545945.3569823
  20. Implications of integrating test-driven development into CS1/CS2 curricula. SIGCSE Bull. 41, 1 (March 2009), 148–152. https://doi.org/10.1145/1539024.1508921
  21. Thomas Dohmke. 2023. GitHub Copilot X: The AI-powered developer experience. https://github.blog/2023-03-22-github-copilot-x-the-ai-powered-developer-experience/. https://github.blog/2023-03-22-github-copilot-x-the-ai-powered-developer-experience/ Accessed: 2023-9-5.
  22. Stephen H Edwards and Zalia Shams. 2014a. Comparing test quality measures for assessing student-written tests. In Companion Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE Companion 2014). Association for Computing Machinery, New York, NY, USA, 354–363. https://doi.org/10.1145/2591062.2591164
  23. Stephen H Edwards and Zalia Shams. 2014b. Do student programmers all tend to write the same software tests?. In Proceedings of the 2014 conference on Innovation & technology in computer science education (Uppsala, Sweden) (ITiCSE ’14). Association for Computing Machinery, New York, NY, USA, 171–176. https://doi.org/10.1145/2591708.2591757
  24. Anders Ericsson and Robert Pool. 2016. Peak: Secrets from the new science of expertise. Random House. https://durmonski.com/wp-content/uploads/2022/03/Peak-by-Anders-Ericsson_Summary.pdf
  25. Automated Repair of Programs from Large Language Models. arXiv preprint arXiv:2205.10583 (2022).
  26. The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian Computing Education Conference. 10–19.
  27. Debugging: finding, fixing and flailing, a multi-institutional study of novice debuggers. Comput. Sci. Educ. 18, 2 (June 2008), 93–116. https://doi.org/10.1080/08993400802114508
  28. Debugging From the Student Perspective. IEEE Trans. Educ. 53, 3 (Aug. 2010), 390–396. https://doi.org/10.1109/TE.2009.2025266
  29. Luciano Floridi. 2023. AI as agency without intelligence: on ChatGPT, large language models, and other generative models. Philosophy & Technology 36, 1 (2023), 15.
  30. Ina Fried. 2023. Sal Khan explains why GPT-4 is ready to be a tutor. https://www.axios.com/2023/04/07/sal-khan-chatgpt-gpt4-tutor. https://www.axios.com/2023/04/07/sal-khan-chatgpt-gpt4-tutor Accessed: 2023-5-6.
  31. Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1747–1764.
  32. Gaole He and Ujwal Gadiraju. 2022. Walking on Eggshells: Using Analogies to Promote Appropriate Reliance in Human-AI Decision Making. In Proceedings of the Workshop on Trust and Reliance on AI-Human Teams at the ACM Conference on Human Factors in Computing Systems (CHI’22).
  33. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916 (2021).
  34. Matthew Hertz. 2010. What Do ”CS1” and ”CS2” Mean? Investigating Differences in the Early Courses. In Proceedings of the 41st ACM Technical Symposium on Computer Science Education (Milwaukee, Wisconsin, USA) (SIGCSE ’10). Association for Computing Machinery, New York, NY, USA, 199–203. https://doi.org/10.1145/1734263.1734335
  35. Re-Factoring Based Program Repair Applied to Programming Assignments. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 388–398. https://doi.org/10.1109/ASE.2019.00044
  36. Maria Kallia. 2023. The Search for Meaning: Inferential Strategic Reading Comprehension in Programming. In Proceedings of the 2023 ACM Conference on International Computing Education Research V.1 (ICER ’23 V1) (Chicago, IL, United States). ACM. https://eprints.gla.ac.uk/304509/
  37. David Klahr and Sharon Mccoy Carver. 1988. Cognitive objectives in a LOGO debugging curriculum: Instruction, learning, and transfer. Cogn. Psychol. 20, 3 (July 1988), 362–404. https://doi.org/10.1016/0010-0285(88)90004-7
  38. Amy J Ko and Brad A Myers. 2008. Debugging reinvented: asking and answering why and why not questions about program behavior. In Proceedings of the 30th international conference on Software engineering (Leipzig, Germany) (ICSE ’08). Association for Computing Machinery, New York, NY, USA, 301–310. https://doi.org/10.1145/1368088.1368130
  39. Evaluating Distance Measures for Program Repair. In Proceedings of the 2023 ACM Conference on International Computing Education Research V.1 (ICER ’23 V1) (Chicago, IL, USA). ACM, New York, NY, USA,, 13 pages. https://doi.org/10.1145/3568813.3600130
  40. Learning Enhancement Using Question-Answer Generation for e-Book Using Contrastive Fine-Tuned T5. In Big Data Analytics. Springer Nature Switzerland, 68–87. https://doi.org/10.1007/978-3-031-24094-2_5
  41. Interactive code generation via test-driven user-intent formalization. arXiv preprint arXiv:2208.05950 (2022).
  42. Interactive code generation via test-driven user-intent formalization. ArXiv (2022). https://doi.org/10.48550/ARXIV.2208.05950
  43. ViDA: A virtual debugging advisor for supporting learning in computer programming courses. J. Comput. Assist. Learn. 34, 3 (June 2018), 243–258. https://doi.org/10.1111/jcal.12238
  44. Towards a Framework for Teaching Debugging. In Proceedings of the Twenty-First Australasian Computing Education Conference (Sydney, NSW, Australia) (ACE ’19). Association for Computing Machinery, New York, NY, USA, 79–86. https://doi.org/10.1145/3286960.3286970
  45. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. (July 2021). arXiv:2107.13586 [cs.CL] http://arxiv.org/abs/2107.13586
  46. Alena Lukasová. 1979. Hierarchical agglomerative clustering procedure. Pattern Recognition 11, 5-6 (1979), 365–381.
  47. Ladebug: an online tool to help novice programmers improve their debugging skills. In Proceedings of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education (Larnaca, Cyprus) (ITiCSE 2018). Association for Computing Machinery, New York, NY, USA, 159–164. https://doi.org/10.1145/3197091.3197098
  48. Introductory programming: a systematic literature review. In Proceedings Companion of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education (Larnaca, Cyprus) (ITiCSE 2018 Companion). Association for Computing Machinery, New York, NY, USA, 55–106. https://doi.org/10.1145/3293881.3295779
  49. Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming. (June 2023). arXiv:2306.05153 [cs.HC] http://arxiv.org/abs/2306.05153
  50. Generating diverse code explanations using the gpt-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 2. 37–39.
  51. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 (2023).
  52. GPTeach: Interactive TA Training with GPT-based Students. In Proceedings of the Tenth ACM Conference on Learning @ Scale (Copenhagen, Denmark) (L@S ’23). Association for Computing Machinery, New York, NY, USA, 226–236. https://doi.org/10.1145/3573051.3593393
  53. Cognitive anatomy of tutor learning: Lessons learned with SimStudent. J. Educ. Psychol. 105, 4 (Nov. 2013), 1152–1163. https://doi.org/10.1037/a0031955
  54. Debugging: A Review of the Literature from an Educational Perspective. Computer Science Education 18, 2 (June 2008), 67–92. https://doi.org/10.1080/08993400802114581
  55. Tilman Michaeli and Ralf Romeike. 2019. Improving Debugging Skills in the Classroom: The Effects of Teaching a Systematic Debugging Process. In Proceedings of the 14th Workshop in Primary and Secondary Computing Education (Glasgow, Scotland, Uk) (WiPSCE’19, Article 15). Association for Computing Machinery, New York, NY, USA, 1–7. https://doi.org/10.1145/3361721.3361724
  56. Melanie Mitchell and David C Krakauer. 2023. The debate over understanding in AI’s large language models. Proceedings of the National Academy of Sciences 120, 13 (2023), e2215907120.
  57. Reading between the lines: Modeling user behavior and costs in AI-assisted programming. arXiv preprint arXiv:2210.14306 (2022).
  58. Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. arXiv preprint arXiv:2211.05655 (2022).
  59. “Oh dear stacy!”: social interaction, elaboration, and learning with teachable agents. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, 39–48. https://doi.org/10.1145/2207676.2207684
  60. Zachary A Pardos and Shreya Bhandari. 2023. Learning gain differences between ChatGPT and human tutor generated algebra hints. arXiv preprint arXiv:2302.06871 (2023).
  61. Studying the advancement in debugging practice of professional software developers. Software Quality Journal 25, 1 (March 2017), 83–110. https://doi.org/10.1007/s11219-015-9294-2
  62. Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors. (June 2023). arXiv:2306.17156 [cs.CY] http://arxiv.org/abs/2306.17156
  63. Using undergraduates as teaching assistants in introductory programming courses: An update on the Stanford experience. In Proceedings of the twenty-sixth SIGCSE technical symposium on Computer science education. 48–52.
  64. Reviewing affective, behavioural and cognitive learning gains in higher education. Assessment & Evaluation in Higher Education 44, 3 (April 2019), 321–337. https://doi.org/10.1080/02602938.2018.1504277
  65. The programmer’s assistant: Conversational interaction with a large language model for software development. In Proceedings of the 28th International Conference on Intelligent User Interfaces. 491–514.
  66. Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses. (June 2023). arXiv:2306.10073 [cs.CY] http://arxiv.org/abs/2306.10073
  67. Evaluating the effectiveness of pre-and post-test model of learning in a medical school. National Journal of Physiology, Pharmacy and Pharmacology 7, 9 (2017), 947.
  68. Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses. arXiv preprint arXiv:2306.03097 (2023).
  69. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  70. Are neural language models good plagiarists? A benchmark for neural paraphrase detection. In 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 226–229.
  71. Towards Human-Like Educational Question Generation with Large Language Models. In Artificial Intelligence in Education. Springer International Publishing, 153–166. https://doi.org/10.1007/978-3-031-11644-5_13
  72. Analysis of a Process for Introductory Debugging. In Australasian Computing Education Conference (Virtual, SA, Australia) (ACE ’21). Association for Computing Machinery, New York, NY, USA, 11–20. https://doi.org/10.1145/3441636.3442300
  73. A Think-aloud Study of Novice Debugging. ACM Trans. Comput. Educ. (March 2023). https://doi.org/10.1145/3589004
  74. Grant P Wiggins and Jay McTighe. 2005. Understanding by Design. ASCD. https://play.google.com/store/books/details?id=N2EfKlyUN4QC
  75. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22, Article 385). Association for Computing Machinery, New York, NY, USA, 1–22. https://doi.org/10.1145/3491102.3517582
  76. LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs. arXiv preprint arXiv:2307.10168 (2023).
  77. A theory of instruction for introductory programming skills. Computer Science Education 29, 2-3 (July 2019), 205–253. https://doi.org/10.1080/08993408.2019.1565235
  78. Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Clashes. arXiv:2305.13300 [cs.CL]
  79. Shaochun Xu and V Rajlich. 2004. Cognitive process during program debugging. In Proceedings of the Third IEEE International Conference on Cognitive Informatics, 2004. (Victoria, BC, Canada). IEEE, 176–182. https://doi.org/10.1109/COGINF.2004.1327473
  80. Sketching nlp: A case study of exploring the right things to design with language intelligence. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.
  81. Assessing the Quality of GitHub Copilot’s Code Generation. In 18th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE ’22). https://doi.org/10.1145/3558489.3559072
  82. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. arXiv preprint arXiv:2306.15895 (2023).
  83. Andreas Zeller. 2009. Why programs fail: A guide to systematic debugging (2 ed.). Morgan Kaufmann, Oxford, England. https://doi.org/10.1016/b978-0-12-374515-6.x0000-7
  84. DocCoder: Generating code by retrieving and reading docs. ArXiv (2022). https://doi.org/10.48550/ARXIV.2207.05987
  85. Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (San Diego CA USA). ACM, New York, NY, USA. https://doi.org/10.1145/3520312.3534864
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Qianou Ma (7 papers)
  2. Hua Shen (32 papers)
  3. Kenneth Koedinger (12 papers)
  4. Tongshuang Wu (53 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com