Benchmarking Educational Program Repair (2405.05347v1)
Abstract: The emergence of LLMs has sparked enormous interest due to their potential application across a range of educational tasks. For example, recent work in programming education has used LLMs to generate learning resources, improve error messages, and provide feedback on code. However, one factor that limits progress within the field is that much of the research uses bespoke datasets and different evaluation metrics, making direct comparisons between results unreliable. Thus, there is a pressing need for standardization and benchmarks that facilitate the equitable comparison of competing approaches. One task where LLMs show great promise is program repair, which can be used to provide debugging support and next-step hints to students. In this article, we propose a novel educational program repair benchmark. We curate two high-quality publicly available programming datasets, present a unified evaluation procedure introducing a novel evaluation metric rouge@k for approximating the quality of repairs, and evaluate a set of five recent models to establish baseline performance.
- SYNFIX: Automatically Fixing Syntax Errors using Compiler Diagnostics. arXiv:2104.14671 [cs.SE]
- Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL]
- A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. arXiv:2302.04023 [cs.CL]
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
- How is ChatGPT’s behavior changing over time? arXiv:2307.09009 [cs.CL]
- Evaluating language Models Trained on Code. arXiv:2107.03374 [cs.LG]
- FalconCode: A Multiyear Dataset of Python Code Samples from an Introductory Computer Science Course. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 938–944. https://doi.org/10.1145/3545945.3569822
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
- Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 1136–1142. https://doi.org/10.1145/3545945.3569823
- Computing Education in the Era of Generative AI. arXiv preprint arXiv:2306.02608 (2023).
- Cognitive Load Theory in Computing Education Research: A Review. ACM Trans. Comput. Educ. 22, 4, Article 40 (sep 2022), 27 pages. https://doi.org/10.1145/3483843
- Syntax exercises in CS1. In Proceedings of the 2020 ACM Conference on International Computing Education Research. 216–226.
- How to design programs: an introduction to programming and computing. MIT Press.
- The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In Proceedings of the 24th Australasian Computing Education Conference (Virtual Event, Australia) (ACE ’22). Association for Computing Machinery, New York, NY, USA, 10–19. https://doi.org/10.1145/3511861.3511863
- My AI Wants to Know If This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In Proceedings of the 25th Australasian Computing Education Conference (Melbourne, VIC, Australia) (ACE ’23). Association for Computing Machinery, New York, NY, USA, 97–104. https://doi.org/10.1145/3576123.3576134
- Shahriar Golchin and Mihai Surdeanu. 2023. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. arXiv:2308.08493 [cs.CL]
- Example-based feedback provision using structured solution spaces. International Journal of Learning Technology 10 9, 3 (2014), 248–280.
- Automated Clustering and Program Repair for Introductory Programming Assignments. http://arxiv.org/abs/1603.03165 arXiv:1603.03165 [cs].
- Neural Attribution for Semantic Bug-Localization in Student Programs. Curran Associates Inc., Red Hook, NY, USA.
- Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests. In Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (Chicago, IL, USA) (ICER ’23). Association for Computing Machinery, New York, NY, USA, 93–105. https://doi.org/10.1145/3568813.3600139
- Measuring Coding Challenge Competence With APPS. arXiv:2105.09938 [cs.SE]
- Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]
- Parameter-Efficient Transfer Learning for NLP. arXiv:1902.00751 [cs.LG]
- Re-factoring based Program Repair applied to Programming Assignments. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE/ACM, 388–398.
- Review of recent systems for automatic assessment of programming assignments. In Proceedings of the 10th Koli calling international conference on computing education research. 86–93.
- Repair Is Nearly Generation: Multilingual Program Repair with LLMs. arXiv:2208.11640 [cs.SE]
- ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences 103 (2023), 102274.
- A systematic literature review of automated feedback generation for programming exercises. ACM Transactions on Computing Education (TOCE) 19, 1 (2018), 1–43.
- Exploring the Potential of Large Language Models to Generate Formative Programming Feedback. arXiv preprint arXiv:2309.00029 (2023).
- Charles Koutcheme. 2022. Towards Open Natural Language Feedback Generation for Novice Programmers Using Large Language Models. In Proceedings of the 22nd Koli Calling International Conference on Computing Education Research (Koli, Finland) (Koli Calling ’22). Association for Computing Machinery, New York, NY, USA, Article 29, 2 pages. https://doi.org/10.1145/3564721.3565955
- Charles Koutcheme. 2023. Training Language Models for Programming Feedback Using Automated Repair Tools. In Artificial Intelligence in Education, Ning Wang, Genaro Rebolledo-Mendez, Noboru Matsuda, Olga C. Santos, and Vania Dimitrova (Eds.). Springer Nature Switzerland, Cham, 830–835.
- Evaluating Distance Measures for Program Repair. In Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (Chicago, IL, USA) (ICER ’23). Association for Computing Machinery, New York, NY, USA, 495–507. https://doi.org/10.1145/3568813.3600130
- Automated Program Repair Using Generative Models for Code Infilling. In Artificial Intelligence in Education, Ning Wang, Genaro Rebolledo-Mendez, Noboru Matsuda, Olga C. Santos, and Vania Dimitrova (Eds.). Springer Nature Switzerland, Cham, 798–803.
- DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. arXiv:2211.11501 [cs.SE]
- MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010).
- Comparing Code Explanations Created by Students and Large Language Models. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (Turku, Finland) (ITiCSE 2023). Association for Computing Machinery, New York, NY, USA, 124–130. https://doi.org/10.1145/3587102.3588785
- Using language Models to Enhance Programming Error Messages. In Proceedings of the 2023 ACM SIGCSE Technical Symposium on Computer Science Education. https://doi.org/10.1145/3545945.3569770
- StarCoder: may the source be with you! arXiv:2305.06161 [cs.CL]
- CodeHelp: Using Large Language Models with Guardrails for Scalable Support in Programming Classes. arXiv:2308.06921 [cs.CY]
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
- QuixBugs: A Multi-Lingual Program Repair Benchmark Set Based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity (Vancouver, BC, Canada) (SPLASH Companion 2017). Association for Computing Machinery, New York, NY, USA, 55–56. https://doi.org/10.1145/3135932.3135941
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv:2102.04664 [cs.SE]
- Introductory programming: a systematic literature review. In Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education. 55–106.
- Experiences from using code explanations generated by large language models in a web software development e-book. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 931–937.
- Yana Malysheva and Caitlin Kelleher. 2022. An Algorithm for Generating Explainable Corrections to Student Code. In Proceedings of the 22nd Koli Calling International Conference on Computing Education Research (Koli, Finland) (Koli Calling ’22). Association for Computing Machinery, New York, NY, USA, Article 13, 11 pages. https://doi.org/10.1145/3564721.3564731
- OctoPack: Instruction Tuning Code Large Language Models. arXiv:2308.07124 [cs.CL]
- Comprehension first: evaluating a novel pedagogy and tutoring system for program tracing in CS1. In Proceedings of the 2017 ACM conference on international computing education research. 2–11.
- CodeGen: An Open language Model for Code with Multi-Turn Program Synthesis. arXiv:2203.13474 [cs.LG]
- Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]
- Automated assessment in computer science education: A state-of-the-art review. ACM Transactions on Computing Education (TOCE) 22, 3 (2022), 1–40.
- Maciej Pankiewicz and Ryan S Baker. 2023. Large Language Models (GPT) for automating feedback on programming assignments. arXiv preprint arXiv:2307.00150 (2023).
- Generating High-Precision Feedback for Programming Syntax Errors using language Models. arXiv:2302.04662 [cs.PL]
- Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors. arXiv:2306.17156 [cs.CY]
- Julian Aron Prenner and Romain Robbes. 2021. Automatic Program Repair with OpenAI’s Codex: Evaluating QuixBugs. arXiv:2111.03922 [cs.SE]
- Evaluation of a Data-Driven Feedback Algorithm for Open-Ended Programming. International Educational Data Mining Society (2017).
- Kelly Rivers and Kenneth R Koedinger. 2017. Data-driven hint generation in vast solution spaces: a self-improving python programming tutor. International Journal of Artificial Intelligence in Education 27 (2017), 37–64.
- Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]
- Automatic generation of programming exercises and code explanations using large language models. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 1. 27–43.
- Empirical Evaluation of Deep Learning Models for Knowledge Tracing: Of Hyperparameters and Metrics on Performance and Replicability. Journal of Educational Data Mining 14, 2 (2022).
- Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? arXiv preprint (2023). https://doi.org/10.48550/arXiv.2303.09325 arXiv:2303.09325 [cs.AI]
- Juha Sorva and Otto Seppälä. 2014. Research-based design of the first weeks of CS1. In Proceedings of the 14th Koli Calling International Conference on Computing Education Research. 71–80.
- Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models. arXiv:2205.10770 [cs.CL]
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
- A systematic review of approaches for teaching introductory programming and their influence on success. In Proceedings of the tenth annual conference on International computing education research. 19–26.
- Data-Driven Feedback Generation for Introductory Programming Exercises. arXiv:1711.07148 [cs] (Nov. 2017). http://arxiv.org/abs/1711.07148 arXiv: 1711.07148.
- Finetuned Language Models Are Zero-Shot Learners. arXiv:2109.01652 [cs.CL]
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
- Michel Wermelinger. 2023. Using GitHub Copilot to Solve Simple Programming Problems. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 172–178. https://doi.org/10.1145/3545945.3569830
- HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv:1910.03771 [cs.CL]
- ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback. CoRR abs/2107.14035 (2021). arXiv:2107.14035 https://arxiv.org/abs/2107.14035
- Repairing Bugs in Python Assignments Using language Models. arXiv:2209.14876 [cs.SE]
- Charles Koutcheme (6 papers)
- Nicola Dainese (6 papers)
- Sami Sarsa (17 papers)
- Juho Leinonen (41 papers)
- Arto Hellas (31 papers)
- Paul Denny (67 papers)