Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models (2311.16017v1)
Abstract: Identifying and resolving logic errors can be one of the most frustrating challenges for novices programmers. Unlike syntax errors, for which a compiler or interpreter can issue a message, logic errors can be subtle. In certain conditions, buggy code may even exhibit correct behavior -- in other cases, the issue might be about how a problem statement has been interpreted. Such errors can be hard to spot when reading the code, and they can also at times be missed by automated tests. There is great educational potential in automatically detecting logic errors, especially when paired with suitable feedback for novices. LLMs have recently demonstrated surprising performance for a range of computing tasks, including generating and explaining code. These capabilities are closely linked to code syntax, which aligns with the next token prediction behavior of LLMs. On the other hand, logic errors relate to the runtime performance of code and thus may not be as well suited to analysis by LLMs. To explore this, we investigate the performance of two popular LLMs, GPT-3 and GPT-4, for detecting and providing a novice-friendly explanation of logic errors. We compare LLM performance with a large cohort of introductory computing students $(n=964)$ solving the same error detection task. Through a mixed-methods analysis of student and model responses, we observe significant improvement in logic error identification between the previous and current generation of LLMs, and find that both LLM generations significantly outperform students. We outline how such models could be integrated into computing education tools, and discuss their potential for supporting students when learning programming.
- Basma S Alqadi and Jonathan I Maletic. 2017. An empirical study of debugging patterns among novices programmers. In Proc. of the 2017 ACM SIGCSE technical Symp. on computer science education. 15–20.
- Amjad Altadmri and Neil CC Brown. 2015. 37 million compilations: Investigating novice programming mistakes in large-scale student data. In Proc. of the 46th ACM Technical Symp. on Computer Science Education. 522–527.
- Ask Me Anything: A simple strategy for prompting language models. ArXiv abs/2210.02441 (2022).
- Joseph E Beck and Yue Gong. 2013. Wheel-spinning: Students who fail to master a skill. In Int. conf. on artificial intelligence in education. Springer, 431–440.
- Compiler error messages considered unhelpful: The landscape of text-based programming error message research. Proc. of the working group reports on innovation and technology in computer science education (2019).
- Neil CC Brown and Amjad Altadmri. 2014. Investigating novice programming mistakes: Educator beliefs vs. student data. In Proc. of the tenth annual Conf. on Int. computing education research. 43–50.
- Neil CC Brown and Amjad Altadmri. 2017. Novice Java programming mistakes: Large-scale data vs. educator beliefs. ACM Trans. on Computing Education (TOCE) 17, 2 (2017), 1–21.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. In Proc. of the 54th ACM Technical Symp. on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). ACM, New York, NY, USA, 1136–1142. https://doi.org/10.1145/3545945.3569823
- Promptly: Using Prompt Problems to Teach Learners How to Effectively Utilize AI Code Generators. arXiv preprint arXiv:2307.16364 (2023).
- All Syntax Errors Are Not Equal. In Proc. of the 17th ACM Annual Conf. on Innovation and Technology in Computer Science Education. ACM, NY, NY, USA, 75–80.
- On Designing Programming Error Messages for Novices: Readability and Its Constituent Factors. In Proc. of the 2021 CHI Conf. on Human Factors in Computing Systems. 1–15.
- Common Logic Errors Made by Novice Programmers. In Proc. of the 20th Australasian Computing Education Conf. ACM, New York, NY, USA, 83–89.
- Automated repair of programs from large language models. In 2023 IEEE/ACM 45th Int. Conf. on Software Engineering (ICSE). IEEE, 1469–1481.
- The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In Australasian Computing Education Conf. ACM, New York, NY, USA, 10–19.
- Debugging: finding, fixing and flailing, a multi-institutional study of novice debuggers. Computer Science Education 18, 2 (2008), 93–116.
- OverCode: Visualizing variation in student solutions to programming problems at scale. ACM Trans. on Computer-Human Interaction (TOCHI) 22, 2 (2015), 1–35.
- Synthesizing research on programmers’ mental models of programs, tasks and concepts—A systematic literature review. Information and Software Technology (2023), 107300.
- Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests. arXiv preprint arXiv:2306.05715 (2023).
- More Robots are Coming: Large Multimodal Models (ChatGPT) can Solve Visually Diverse Images of Parsons Problems. arXiv preprint arXiv:2311.04926 (2023). https://doi.org/10.48550/arXiv.2311.04926
- The Effects of Generative AI on Introductory Students’ Help-Seeking Preferences. In Australasian Computing Education Conference (ACM ACE ’24).
- Impact of code language models on automated program repair. arXiv preprint arXiv:2302.05020 (2023).
- Bug Catalogue: I. Technical Report. Yale University, YaleU/CSD/RR #286.
- Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions. arXiv preprint arXiv:2308.02312 (2023).
- Tobias Kohn. 2019. The error behind the message: Finding the cause of error messages in python. In Proc. of the 50th ACM Technical Symp. on Computer Science Education. 524–530.
- Teemu Koivisto and Arto Hellas. 2022. Evaluating CodeClusters for Effectively Providing Feedback on Code Submissions. In 2022 IEEE Frontiers in Education Conf. (FIE). IEEE, 1–9.
- Automated Program Repair Using Generative Models for Code Infilling. In Int. Conf. on Artificial Intelligence in Education. Springer, 798–803.
- Sam Lau and Philip J. Guo. 2023. From ‘Ban It Till We Understand It’ to ”Resistance is Futile”: How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot. In In Proc. of the 2023 ACM Conf. on Int. Computing Education Research V.1 (ICER ’23 V1). ACM. https://doi.org/10.1145/3568813.3600138
- Comparing Code Explanations Created by Students and Large Language Models. arXiv preprint arXiv:2304.03938 (2023).
- Using large language models to enhance programming error messages. In Proc. of the 54th ACM Technical Symp. on Computer Science Education V. 1. 563–569.
- A multi-national study of reading and tracing skills in novice programmers. ACM SIGCSE Bulletin 36, 4 (2004), 119–150.
- Further Evidence of a Relationship between Explaining, Tracing and Writing Skills in Introductory Programming. SIGCSE Bull. 41, 3 (2009), 161–165.
- The Implications of Large Language Models for CS Teachers and Students. In Proc. of the 54th ACM Technical Symp. on Computer Science Education V. 2. ACM, New York, NY, USA, 1255.
- Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In Proc. SIGCSE’23. ACM, 6 pages.
- Automatically Generating CS Learning Materials with Large Language Models. In Proc. of the 54th ACM Technical Symp. on Computer Science Education V. 2. ACM, New York, NY, USA, 1176.
- Yana Malysheva and Caitlin Kelleher. 2020. Bugs as Features: Describing Patterns in Student Code through a Classification of Bugs. In Extended Abstracts of the 2020 CHI Conf. on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–7.
- Davin McCall and Michael Kölling. 2019. A new look at novice programmer errors. ACM Trans. on Computing Education (TOCE) 19, 4 (2019), 1–30.
- A multi-national, multi-institutional study of assessment of programming skills of first-year CS students. In Working group reports from ITiCSE on Innovation and technology in computer science education. 125–180.
- Debugging: the good, the bad, and the quirky–a qualitative analysis of novices’ strategies. In ACM SIGCSE Bulletin, Vol. 40. ACM.
- Transformed by Transformers: Navigating the AI Coding Revolution for Computing Education: An ITiCSE Working Group Conducted by Humans. In Proc. of the 2023 Conf. on Innovation and Technology in Computer Science Education V. 2. 561–562.
- The robots are here: Navigating the generative ai revolution in computing education. arXiv preprint arXiv:2310.00658 (2023).
- Ben Puryear and Gina Sprint. 2022. Github copilot in the classroom: learning to code with AI assistance. J. of Computing Sciences in Colleges 38, 1 (2022), 37–47.
- Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In Proc. of the 2022 ACM Conf. on Int. Computing Education Research - Volume 1. ACM, 27–43.
- Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses. The 19th ACM Conference on International Computing Education Research (ICER) (2023).
- Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code. arXiv:2303.08033 [cs.CL]
- Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? arXiv preprint arXiv:2303.09325 (2023).
- Do we know how difficult the rainfall problem is?. In Proc. of the 15th Koli Calling Conf. on Computing Education Research. 87–96.
- Prompting GPT-3 To Be Reliable. arXiv:2210.09150 [cs.CL]
- Rebecca Smith and Scott Rixner. 2019. The error landscape: Characterizing the mistakes of novice programmers. In Proc. of the 50th ACM technical Symp. on computer science education. 538–544.
- Cognitive strategies and looping constructs: An empirical study. Commun. ACM 26, 11 (1983), 853–860.
- What do novices know about programming? In Directions in Human–Computer Interactions. Vol. 6. Ablex Publishing, 27–54.
- James C Spohrer and Elliot Soloway. 1986. Novice mistakes: Are the folk wisdoms correct? Commun. ACM 29, 7 (1986), 624–632.
- Improved performance of ChatGPT-4 on the OKAP exam: A comparative study with ChatGPT-3.5. medRxiv (2023), 2023–04.
- Using Large Language Models to Automatically Identify Programming Concepts in Code Snippets. In Proc. of the 2023 ACM Conf. on Int. Computing Education Research - Volume 2, Vol. 1. ACM, 563–569.
- Vesa Vainio and Jorma Sajaniemi. 2007. Factors in novice programmers’ poor tracing skills. ACM SIGCSE Bulletin 39, 3 (2007), 236–240.
- Michel Wermelinger. 2023. Using GitHub Copilot to Solve Simple Programming Problems. In Proc. of the 54th ACM Technical Symp. on Computer Science Education V. 1. ACM.
- The many ways of the BRACElet project. Bull. of Applied Computing and Information Technology (2007).
- Novice reflections on debugging. In Proc. of the 52nd ACM Technical Symp. on Computer Science Education. 73–79.
- An Australasian Study of Reading and Comprehension Skills in Novice Programmers, Using the Bloom and SOLO Taxonomies. In Proc. of the 8th Australasian Conf. on Computing Education - Volume 52. Australian Computer Society, Inc., AUS, 243–252.
- Michael Winikoff. 2014. Novice programmers’ faults & failures in GOAL programs. In Proc. of the 2014 Int. Conf. on Autonomous agents and multi-agent systems. 301–308.
- Generative AI in Computing Education: Perspectives of Students and Instructors. arXiv preprint arXiv:2308.04309 (2023). https://doi.org/10.48550/arXiv.2308.04309
- Calibrate Before Use: Improving Few-Shot Performance of Language Models. In Int. Conf. on Machine Learning.
- Stephen MacNeil (37 papers)
- Paul Denny (67 papers)
- Andrew Tran (8 papers)
- Juho Leinonen (41 papers)
- Seth Bernstein (6 papers)
- Arto Hellas (31 papers)
- Sami Sarsa (17 papers)
- Joanne Kim (8 papers)