How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging (2310.05292v5)
Abstract: LLMs now excel at generative skills and can create content at impeccable speeds. However, they are imperfect and still make various mistakes. In a Computer Science education context, as these models are widely recognized as "AI pair programmers," it becomes increasingly important to train students on evaluating and debugging the LLM-generated code. In this work, we introduce HypoCompass, a novel system to facilitate deliberate practice on debugging, where human novices play the role of Teaching Assistants and help LLM-powered teachable agents debug code. We enable effective task delegation between students and LLMs in this learning-by-teaching environment: students focus on hypothesizing the cause of code errors, while adjacent skills like code completion are offloaded to LLM-agents. Our evaluations demonstrate that HypoCompass generates high-quality training materials (e.g., bugs and fixes), outperforming human counterparts fourfold in efficiency, and significantly improves student performance on debugging by 12% in the pre-to-post test.
- 2014. Why Don’t Schools Teach Debugging? https://news.ycombinator.com/item?id=7215870. https://news.ycombinator.com/item?id=7215870 Accessed: 2023-9-8.
- Prediction of Learner Information-Seeking Behavior and Classroom Engagement in the Advent of ChatGPT. In International Conference on Smart Learning Environments. Springer, 117–126.
- Integrated and Tool-Supported Teaching of Testing, Debugging, and Verification. In Teaching Formal Methods. Springer Berlin Heidelberg, 125–143. https://doi.org/10.1007/978-3-642-04912-5_9
- Self-Supervised Bug Detection and Repair. (May 2021). arXiv:2105.12787 [cs.LG] http://arxiv.org/abs/2105.12787
- A general framework for debugging. IEEE Softw. 8, 3 (May 1991), 14–20. https://doi.org/10.1109/52.88939
- Reusing Bugged Source Code to Support Novice Programmers in Debugging Tasks. ACM Trans. Comput. Educ. 20, 1 (Nov. 2019), 1–24. https://doi.org/10.1145/3355616
- Program Synthesis with Large Language Models. ArXiv (2021). https://www.semanticscholar.org/paper/a38e0f993e4805ba8a9beae4c275c91ffcec01df
- Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.
- Programming is hard-or at least it used to be: Educational opportunities and challenges of ai code generation. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 500–506.
- Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue 20, 6 (2022), 35–57.
- Pedagogical Agents for Learning by Teaching: Teachable Agents. Educ. Technol. Res. Dev. 47, 1 (2007), 56–61. http://www.jstor.org/stable/44429380
- Sparks of Artificial General Intelligence: Early experiments with GPT-4. (March 2023). arXiv:2303.12712 [cs.CL] http://arxiv.org/abs/2303.12712
- Guillermo Campitelli and Fernand Gobet. 2011. Deliberate Practice: Necessary But Not Sufficient. Curr. Dir. Psychol. Sci. 20, 5 (Oct. 2011), 280–285. https://doi.org/10.1177/0963721411421922
- Elizabeth Emily Carter. 2014. An Intelligent Debugging Tutor For Novice Computer Science Students. Ph. D. Dissertation. Lehigh University. https://core.ac.uk/download/pdf/228636792.pdf
- Venera-Mihaela Cojocariu and Gabriel Mareş. 2022. Academic Integrity in the Technology-Driven Education Era. In Ethical Use of Information Technology in Higher Education, Liliana Mâță (Ed.). Springer Singapore, Singapore, 1–16. https://doi.org/10.1007/978-981-16-1951-9_1
- GitHub Copilot. 2021. Your AI pair programmer.
- Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT. (April 2023). https://doi.org/10.35542/osf.io/hcgzj
- GitHub Copilot AI pair programmer: Asset or Liability? ArXiv (2022). https://doi.org/10.48550/ARXIV.2206.15331
- Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 1136–1142. https://doi.org/10.1145/3545945.3569823
- Implications of integrating test-driven development into CS1/CS2 curricula. SIGCSE Bull. 41, 1 (March 2009), 148–152. https://doi.org/10.1145/1539024.1508921
- Thomas Dohmke. 2023. GitHub Copilot X: The AI-powered developer experience. https://github.blog/2023-03-22-github-copilot-x-the-ai-powered-developer-experience/. https://github.blog/2023-03-22-github-copilot-x-the-ai-powered-developer-experience/ Accessed: 2023-9-5.
- Stephen H Edwards and Zalia Shams. 2014a. Comparing test quality measures for assessing student-written tests. In Companion Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE Companion 2014). Association for Computing Machinery, New York, NY, USA, 354–363. https://doi.org/10.1145/2591062.2591164
- Stephen H Edwards and Zalia Shams. 2014b. Do student programmers all tend to write the same software tests?. In Proceedings of the 2014 conference on Innovation & technology in computer science education (Uppsala, Sweden) (ITiCSE ’14). Association for Computing Machinery, New York, NY, USA, 171–176. https://doi.org/10.1145/2591708.2591757
- Anders Ericsson and Robert Pool. 2016. Peak: Secrets from the new science of expertise. Random House. https://durmonski.com/wp-content/uploads/2022/03/Peak-by-Anders-Ericsson_Summary.pdf
- Automated Repair of Programs from Large Language Models. arXiv preprint arXiv:2205.10583 (2022).
- The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian Computing Education Conference. 10–19.
- Debugging: finding, fixing and flailing, a multi-institutional study of novice debuggers. Comput. Sci. Educ. 18, 2 (June 2008), 93–116. https://doi.org/10.1080/08993400802114508
- Debugging From the Student Perspective. IEEE Trans. Educ. 53, 3 (Aug. 2010), 390–396. https://doi.org/10.1109/TE.2009.2025266
- Luciano Floridi. 2023. AI as agency without intelligence: on ChatGPT, large language models, and other generative models. Philosophy & Technology 36, 1 (2023), 15.
- Ina Fried. 2023. Sal Khan explains why GPT-4 is ready to be a tutor. https://www.axios.com/2023/04/07/sal-khan-chatgpt-gpt4-tutor. https://www.axios.com/2023/04/07/sal-khan-chatgpt-gpt4-tutor Accessed: 2023-5-6.
- Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1747–1764.
- Gaole He and Ujwal Gadiraju. 2022. Walking on Eggshells: Using Analogies to Promote Appropriate Reliance in Human-AI Decision Making. In Proceedings of the Workshop on Trust and Reliance on AI-Human Teams at the ACM Conference on Human Factors in Computing Systems (CHI’22).
- Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916 (2021).
- Matthew Hertz. 2010. What Do ”CS1” and ”CS2” Mean? Investigating Differences in the Early Courses. In Proceedings of the 41st ACM Technical Symposium on Computer Science Education (Milwaukee, Wisconsin, USA) (SIGCSE ’10). Association for Computing Machinery, New York, NY, USA, 199–203. https://doi.org/10.1145/1734263.1734335
- Re-Factoring Based Program Repair Applied to Programming Assignments. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 388–398. https://doi.org/10.1109/ASE.2019.00044
- Maria Kallia. 2023. The Search for Meaning: Inferential Strategic Reading Comprehension in Programming. In Proceedings of the 2023 ACM Conference on International Computing Education Research V.1 (ICER ’23 V1) (Chicago, IL, United States). ACM. https://eprints.gla.ac.uk/304509/
- David Klahr and Sharon Mccoy Carver. 1988. Cognitive objectives in a LOGO debugging curriculum: Instruction, learning, and transfer. Cogn. Psychol. 20, 3 (July 1988), 362–404. https://doi.org/10.1016/0010-0285(88)90004-7
- Amy J Ko and Brad A Myers. 2008. Debugging reinvented: asking and answering why and why not questions about program behavior. In Proceedings of the 30th international conference on Software engineering (Leipzig, Germany) (ICSE ’08). Association for Computing Machinery, New York, NY, USA, 301–310. https://doi.org/10.1145/1368088.1368130
- Evaluating Distance Measures for Program Repair. In Proceedings of the 2023 ACM Conference on International Computing Education Research V.1 (ICER ’23 V1) (Chicago, IL, USA). ACM, New York, NY, USA,, 13 pages. https://doi.org/10.1145/3568813.3600130
- Learning Enhancement Using Question-Answer Generation for e-Book Using Contrastive Fine-Tuned T5. In Big Data Analytics. Springer Nature Switzerland, 68–87. https://doi.org/10.1007/978-3-031-24094-2_5
- Interactive code generation via test-driven user-intent formalization. arXiv preprint arXiv:2208.05950 (2022).
- Interactive code generation via test-driven user-intent formalization. ArXiv (2022). https://doi.org/10.48550/ARXIV.2208.05950
- ViDA: A virtual debugging advisor for supporting learning in computer programming courses. J. Comput. Assist. Learn. 34, 3 (June 2018), 243–258. https://doi.org/10.1111/jcal.12238
- Towards a Framework for Teaching Debugging. In Proceedings of the Twenty-First Australasian Computing Education Conference (Sydney, NSW, Australia) (ACE ’19). Association for Computing Machinery, New York, NY, USA, 79–86. https://doi.org/10.1145/3286960.3286970
- Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. (July 2021). arXiv:2107.13586 [cs.CL] http://arxiv.org/abs/2107.13586
- Alena Lukasová. 1979. Hierarchical agglomerative clustering procedure. Pattern Recognition 11, 5-6 (1979), 365–381.
- Ladebug: an online tool to help novice programmers improve their debugging skills. In Proceedings of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education (Larnaca, Cyprus) (ITiCSE 2018). Association for Computing Machinery, New York, NY, USA, 159–164. https://doi.org/10.1145/3197091.3197098
- Introductory programming: a systematic literature review. In Proceedings Companion of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education (Larnaca, Cyprus) (ITiCSE 2018 Companion). Association for Computing Machinery, New York, NY, USA, 55–106. https://doi.org/10.1145/3293881.3295779
- Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming. (June 2023). arXiv:2306.05153 [cs.HC] http://arxiv.org/abs/2306.05153
- Generating diverse code explanations using the gpt-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 2. 37–39.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 (2023).
- GPTeach: Interactive TA Training with GPT-based Students. In Proceedings of the Tenth ACM Conference on Learning @ Scale (Copenhagen, Denmark) (L@S ’23). Association for Computing Machinery, New York, NY, USA, 226–236. https://doi.org/10.1145/3573051.3593393
- Cognitive anatomy of tutor learning: Lessons learned with SimStudent. J. Educ. Psychol. 105, 4 (Nov. 2013), 1152–1163. https://doi.org/10.1037/a0031955
- Debugging: A Review of the Literature from an Educational Perspective. Computer Science Education 18, 2 (June 2008), 67–92. https://doi.org/10.1080/08993400802114581
- Tilman Michaeli and Ralf Romeike. 2019. Improving Debugging Skills in the Classroom: The Effects of Teaching a Systematic Debugging Process. In Proceedings of the 14th Workshop in Primary and Secondary Computing Education (Glasgow, Scotland, Uk) (WiPSCE’19, Article 15). Association for Computing Machinery, New York, NY, USA, 1–7. https://doi.org/10.1145/3361721.3361724
- Melanie Mitchell and David C Krakauer. 2023. The debate over understanding in AI’s large language models. Proceedings of the National Academy of Sciences 120, 13 (2023), e2215907120.
- Reading between the lines: Modeling user behavior and costs in AI-assisted programming. arXiv preprint arXiv:2210.14306 (2022).
- Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. arXiv preprint arXiv:2211.05655 (2022).
- “Oh dear stacy!”: social interaction, elaboration, and learning with teachable agents. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, 39–48. https://doi.org/10.1145/2207676.2207684
- Zachary A Pardos and Shreya Bhandari. 2023. Learning gain differences between ChatGPT and human tutor generated algebra hints. arXiv preprint arXiv:2302.06871 (2023).
- Studying the advancement in debugging practice of professional software developers. Software Quality Journal 25, 1 (March 2017), 83–110. https://doi.org/10.1007/s11219-015-9294-2
- Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors. (June 2023). arXiv:2306.17156 [cs.CY] http://arxiv.org/abs/2306.17156
- Using undergraduates as teaching assistants in introductory programming courses: An update on the Stanford experience. In Proceedings of the twenty-sixth SIGCSE technical symposium on Computer science education. 48–52.
- Reviewing affective, behavioural and cognitive learning gains in higher education. Assessment & Evaluation in Higher Education 44, 3 (April 2019), 321–337. https://doi.org/10.1080/02602938.2018.1504277
- The programmer’s assistant: Conversational interaction with a large language model for software development. In Proceedings of the 28th International Conference on Intelligent User Interfaces. 491–514.
- Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses. (June 2023). arXiv:2306.10073 [cs.CY] http://arxiv.org/abs/2306.10073
- Evaluating the effectiveness of pre-and post-test model of learning in a medical school. National Journal of Physiology, Pharmacy and Pharmacology 7, 9 (2017), 947.
- Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses. arXiv preprint arXiv:2306.03097 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Are neural language models good plagiarists? A benchmark for neural paraphrase detection. In 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 226–229.
- Towards Human-Like Educational Question Generation with Large Language Models. In Artificial Intelligence in Education. Springer International Publishing, 153–166. https://doi.org/10.1007/978-3-031-11644-5_13
- Analysis of a Process for Introductory Debugging. In Australasian Computing Education Conference (Virtual, SA, Australia) (ACE ’21). Association for Computing Machinery, New York, NY, USA, 11–20. https://doi.org/10.1145/3441636.3442300
- A Think-aloud Study of Novice Debugging. ACM Trans. Comput. Educ. (March 2023). https://doi.org/10.1145/3589004
- Grant P Wiggins and Jay McTighe. 2005. Understanding by Design. ASCD. https://play.google.com/store/books/details?id=N2EfKlyUN4QC
- AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22, Article 385). Association for Computing Machinery, New York, NY, USA, 1–22. https://doi.org/10.1145/3491102.3517582
- LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs. arXiv preprint arXiv:2307.10168 (2023).
- A theory of instruction for introductory programming skills. Computer Science Education 29, 2-3 (July 2019), 205–253. https://doi.org/10.1080/08993408.2019.1565235
- Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Clashes. arXiv:2305.13300 [cs.CL]
- Shaochun Xu and V Rajlich. 2004. Cognitive process during program debugging. In Proceedings of the Third IEEE International Conference on Cognitive Informatics, 2004. (Victoria, BC, Canada). IEEE, 176–182. https://doi.org/10.1109/COGINF.2004.1327473
- Sketching nlp: A case study of exploring the right things to design with language intelligence. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.
- Assessing the Quality of GitHub Copilot’s Code Generation. In 18th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE ’22). https://doi.org/10.1145/3558489.3559072
- Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. arXiv preprint arXiv:2306.15895 (2023).
- Andreas Zeller. 2009. Why programs fail: A guide to systematic debugging (2 ed.). Morgan Kaufmann, Oxford, England. https://doi.org/10.1016/b978-0-12-374515-6.x0000-7
- DocCoder: Generating code by retrieving and reading docs. ArXiv (2022). https://doi.org/10.48550/ARXIV.2207.05987
- Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (San Diego CA USA). ACM, New York, NY, USA. https://doi.org/10.1145/3520312.3534864
- Qianou Ma (7 papers)
- Hua Shen (32 papers)
- Kenneth Koedinger (12 papers)
- Tongshuang Wu (53 papers)