Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Let's Ask AI About Their Programs: Exploring ChatGPT's Answers To Program Comprehension Questions (2404.11734v1)

Published 17 Apr 2024 in cs.CY
Let's Ask AI About Their Programs: Exploring ChatGPT's Answers To Program Comprehension Questions

Abstract: Recent research has explored the creation of questions from code submitted by students. These Questions about Learners' Code (QLCs) are created through program analysis, exploring execution paths, and then creating code comprehension questions from these paths and the broader code structure. Responding to the questions requires reading and tracing the code, which is known to support students' learning. At the same time, computing education researchers have witnessed the emergence of LLMs that have taken the community by storm. Researchers have demonstrated the applicability of these models especially in the introductory programming context, outlining their performance in solving introductory programming problems and their utility in creating new learning resources. In this work, we explore the capability of the state-of-the-art LLMs (GPT-3.5 and GPT-4) in answering QLCs that are generated from code that the LLMs have created. Our results show that although the state-of-the-art LLMs can create programs and trace program execution when prompted, they easily succumb to similar errors that have previously been recorded for novice programmers. These results demonstrate the fallibility of these models and perhaps dampen the expectations fueled by the recent LLM hype. At the same time, we also highlight future research possibilities such as using LLMs to mimic students as their behavior can indeed be similar for some specific tasks.

Exploring ChatGPT's Capacity to Answer Program Comprehension Questions from Self-Generated Code

Introduction

Researchers at Aalto University have conducted an insightful analysis into the performance of state-of-the-art LLMs, particularly GPT-3.5 and GPT-4, in answering Questions about Learners' Code (QLCs). These QLCs were formulated from code snippets generated by the LLMs themselves, targeting a dual purpose: assessing the models' comprehension of programming constructs they create and understanding common error patterns in their responses.

Experiment Design

The experiment followed a structured sequence:

  1. The LLMs were tasked with generating program code based on provided exercise descriptions.
  2. From these generated programs, QLCs were automatically produced using the QLCpy-library.
  3. The LLMs subsequently attempted to answer these QLCs.
  4. Finally, the researchers manually analyzed the correctness of the LLM responses and categorized errors.

The QLCs aimed to test various aspects of program comprehension, such as variable roles, loop behaviors, and line-specific purposes, reflecting different cognitive levels in program understanding.

Findings and Observations

Performance Summary

Overall, GPT-4 demonstrated superior performance over GPT-3.5 across most QLC types, confirming the incremental improvements in newer LLM generations. The success rate varied significantly depending on the QLC type, with both models showing robust performance in identifying function parameters and variable names but struggling with more dynamic aspects like loop behaviors and trace requirements.

Error Analysis

A detailed error analysis highlighted both models' pitfalls:

  • Logical Errors: Both models occasionally produced illogical steps in code execution or misunderstood code semantics, issues also common among novice programmers.
  • Line Numbering Issues: Misinterpretation of line references within code suggests possible improvements in how LLMs map physical code structure during generation and comprehension tasks.
  • Response Inconsistencies: Particularly in GPT-3.5, inconsistencies in answer justification revealed a lack of coherence, where valid logical deductions were followed by incorrect final answers, or vice versa.
  • Hallucination in Justifications: GPT-4 occasionally adhered to an initially incorrect answer, fabricating justifications to support it, a phenomenon less observed in human cognition.

Implications and Future Opportunities

This research illuminates several pathways and considerations:

  • Model Training and Fine-Tuning: Enhancing training regimes to better encompass and distinguish between syntactic and semantic elements of code could improve LLM performance in both generating and comprehending code.
  • Educational Tools Development: LLMs could be integrated into educational platforms not just for solving problems but for generating pedagogical content, such as automated question generation and answer explanation models.
  • Comparative Studies with Human Learners: Similarities in error patterns between LLMs and students invite further studies to compare learning behaviors and miscomprehensions, potentially using LLM outputs as training data for educational research.

Conclusions

While the LLMs exhibited remarkable capabilities in answering self-generated code comprehension questions, evident limitations call for cautious optimism. The encountered errors, especially in logical reasoning and structural interpretation, underscore the challenges remaining in AI understanding of human-like code comprehension. Future LLM developments and applications, particularly in educational contexts, must carefully consider these aspects to leverage strengths and mitigate shortcomings effectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Programming Is Hard - Or at Least It Used to Be: Educational Opportunities and Challenges of AI Code Generation. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 500–506. https://doi.org/10.1145/3545945.3569759
  2. Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 1136–1142. https://doi.org/10.1145/3545945.3569823
  3. Computing Education in the Era of Generative AI. arXiv:2306.02608 [cs.CY]
  4. Robosourcing Educational Resources – Leveraging Large Language Models for Learnersourcing. arXiv:2211.04715 [cs.HC]
  5. The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In Proceedings of the 24th Australasian Computing Education Conference (Virtual Event, Australia) (ACE ’22). Association for Computing Machinery, New York, NY, USA, 10–19. https://doi.org/10.1145/3511861.3511863
  6. My AI Wants to Know If This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In Proceedings of the 25th Australasian Computing Education Conference (Melbourne, VIC, Australia) (ACE ’23). Association for Computing Machinery, New York, NY, USA, 97–104. https://doi.org/10.1145/3576123.3576134
  7. Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests. In Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (Chicago, IL, USA) (ICER ’23). Association for Computing Machinery, New York, NY, USA, 93–105. https://doi.org/10.1145/3568813.3600139
  8. Fostering Program Comprehension in Novice Programmers - Learning Activities and Learning Trajectories. In Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education (Aberdeen, Scotland Uk) (ITiCSE-WGR ’19). Association for Computing Machinery, New York, NY, USA, 27–52. https://doi.org/10.1145/3344429.3372501
  9. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences 103 (2023), 102274. https://doi.org/10.1016/j.lindif.2023.102274
  10. Studying the Effect of AI Code Generators on Supporting Novice Learners in Introductory Programming. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 455, 23 pages. https://doi.org/10.1145/3544548.3580919
  11. Cazembe Kennedy and Eileen T. Kraemer. 2019. Qualitative Observations of Student Reasoning: Coding in the Wild. In Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education (Aberdeen, Scotland Uk) (ITiCSE ’19). Association for Computing Machinery, New York, NY, USA, 224–230. https://doi.org/10.1145/3304221.3319751
  12. A Systematic Literature Review of Automated Feedback Generation for Programming Exercises. ACM Trans. Comput. Educ. 19, 1, Article 3 (sep 2018), 43 pages. https://doi.org/10.1145/3231711
  13. Learnersourcing in the age of AI: Student, educator and machine partnerships for content creation. Computers and Education: Artificial Intelligence 5 (2023), 100151. https://doi.org/10.1016/j.caeai.2023.100151
  14. Amruth N. Kumar. 2013. A Study of the Influence of Code-Tracing Problems on Code-Writing Skills. In Proceedings of the 18th ACM Conference on Innovation and Technology in Computer Science Education (Canterbury, England, UK) (ITiCSE ’13). Association for Computing Machinery, New York, NY, USA, 183–188. https://doi.org/10.1145/2462476.2462507
  15. Amruth N. Kumar. 2015. Solving Code-Tracing Problems and Its Effect on Code-Writing Skills Pertaining to Program Semantics. In Proceedings of the 2015 ACM Conference on Innovation and Technology in Computer Science Education (Vilnius, Lithuania) (ITiCSE ’15). Association for Computing Machinery, New York, NY, USA, 314–319. https://doi.org/10.1145/2729094.2742587
  16. Automated Questionnaires About Students’ JavaScript Programs: Towards Gauging Novice Programming Processes. In Proceedings of the 25th Australasian Computing Education Conference (Melbourne, VIC, Australia) (ACE ’23). Association for Computing Machinery, New York, NY, USA, 49–58. https://doi.org/10.1145/3576123.3576129
  17. Students Struggle to Explain Their Own Program Code. In Proceedings of the 26th ACM Conference on on Innovation and Technology in Computer Science Education V. 1 (Virtual Event, Germany) (ITiCSE ’21). Association for Computing Machinery, New York, NY, USA, 206–212. https://doi.org/10.1145/3430665.3456322
  18. Let’s Ask Students About Their Programs, Automatically. In 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). IEEE, New York, NY, USA, 467–475. https://doi.org/10.1109/ICPC52881.2021.00054
  19. Automated Questions About Learners’ Own Code Help to Detect Fragile Prerequisite Knowledge. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (Turku, Finland) (ITiCSE 2023). Association for Computing Machinery, New York, NY, USA, 505–511. https://doi.org/10.1145/3587102.3588787
  20. Comparing Code Explanations Created by Students and Large Language Models. In Proceedings of the 28th ACM Conference on on Innovation and Technology in Computer Science Education Vol. 1 (Turku, Finland) (ITiCSE ’23). Association for Computing Machinery, New York, NY, USA, 7 pages. https://doi.org/10.1145/3587102.3588785
  21. Using Large Language Models to Enhance Programming Error Messages. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 563–569. https://doi.org/10.1145/3545945.3569770
  22. Raymond Lister. 2000. On Blooming First Year Programming, and Its Blooming Assessment. In Proceedings of the Australasian Conference on Computing Education (Melbourne, Australia) (ACSE ’00). Association for Computing Machinery, New York, NY, USA, 158–162. https://doi.org/10.1145/359369.359393
  23. Raymond Lister. 2011. Concrete and other neo-Piagetian forms of reasoning in the novice programmer. In Proceedings of the Thirteenth Australasian Computing Education Conference - Volume 114 (Perth, Australia) (ACE ’11). Australian Computer Society, Inc., AUS, 9–18.
  24. Naturally occurring data as research instrument: analyzing examination responses to study the novice programmer. SIGCSE Bull. 41, 4 (jan 2010), 156–173. https://doi.org/10.1145/1709424.1709460
  25. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages. https://doi.org/10.1145/3560815
  26. Relationships between Reading, Tracing and Writing Skills in Introductory Programming. In Proceedings of the Fourth International Workshop on Computing Education Research (Sydney, Australia) (ICER ’08). Association for Computing Machinery, New York, NY, USA, 101–112. https://doi.org/10.1145/1404520.1404531
  27. Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming. arXiv:2306.05153 [cs.HC]
  28. The Implications of Large Language Models for CS Teachers and Students. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 2 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 1255. https://doi.org/10.1145/3545947.3573358
  29. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 931–937. https://doi.org/10.1145/3545945.3569785
  30. Automatically Generating CS Learning Materials with Large Language Models. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 2 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 1176. https://doi.org/10.1145/3545947.3569630
  31. Generating Diverse Code Explanations Using the GPT-3 Large Language Model. In Proceedings of the 2022 ACM Conference on International Computing Education Research - Volume 2 (Lugano and Virtual Event, Switzerland) (ICER ’22). Association for Computing Machinery, New York, NY, USA, 37–39. https://doi.org/10.1145/3501709.3544280
  32. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1906–1919. https://doi.org/10.18653/v1/2020.acl-main.173
  33. A Multi-National, Multi-Institutional Study of Assessment of Programming Skills of First-Year CS Students. In Working Group Reports from ITiCSE on Innovation and Technology in Computer Science Education (Canterbury, UK) (ITiCSE-WGR ’01). Association for Computing Machinery, New York, NY, USA, 125–180. https://doi.org/10.1145/572133.572137
  34. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  35. Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models. In Proceedings of the 16th International Conference on Educational Data Mining. International Educational Data Mining Society, Massachusetts, MA, USA, 370–377. https://doi.org/10.5281/zenodo.8115653
  36. Transformed by Transformers: Navigating the AI Coding Revolution for Computing Education: An ITiCSE Working Group Conducted by Humans. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 2 (Turku, Finland) (ITiCSE 2023). Association for Computing Machinery, New York, NY, USA, 561–562. https://doi.org/10.1145/3587103.3594206
  37. “It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers. ACM Trans. Comput.-Hum. Interact. 31, 1, Article 4 (nov 2023), 31 pages. https://doi.org/10.1145/3617367
  38. Ruixiang Qi and Davide Fossati. 2020. Unlimited Trace Tutor: Learning Code Tracing With Automatically Generated Programs. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education (Portland, OR, USA) (SIGCSE ’20). Association for Computing Machinery, New York, NY, USA, 427–433. https://doi.org/10.1145/3328778.3366939
  39. Exploring ChatGPT’s Impact on Post-Secondary Education: A Qualitative Study. In Proceedings of the 25th Western Canadian Conference on Computing Education (Vancouver, BC, Canada) (WCCCE ’23). Association for Computing Machinery, New York, NY, USA, Article 9, 6 pages. https://doi.org/10.1145/3593342.3593360
  40. Arun Raman and Viraj Kumar. 2022. Programming Pedagogy and Assessment in the Era of AI/ML: A Position Paper. In Proceedings of the 15th Annual ACM India Compute Conference (Jaipur, India) (COMPUTE ’22). Association for Computing Machinery, New York, NY, USA, 29–34. https://doi.org/10.1145/3561833.3561843
  41. Jean Salac and Diana Franklin. 2020. If They Build It, Will They Understand It? Exploring the Relationship between Student Code and Performance. In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education (Trondheim, Norway) (ITiCSE ’20). Association for Computing Machinery, New York, NY, USA, 473–479. https://doi.org/10.1145/3341525.3387379
  42. Jask: Generation of Questions About Learners’ Code in Java. In Proceedings of the 27th ACM Conference on on Innovation and Technology in Computer Science Education Vol. 1 (Dublin, Ireland) (ITiCSE ’22). Association for Computing Machinery, New York, NY, USA, 117–123. https://doi.org/10.1145/3502718.3524761
  43. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In Proceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1 (Lugano and Virtual Event, Switzerland) (ICER ’22). Association for Computing Machinery, New York, NY, USA, 27–43. https://doi.org/10.1145/3501385.3543957
  44. Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses. In Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (Chicago, IL, USA) (ICER ’23). Association for Computing Machinery, New York, NY, USA, 78–92. https://doi.org/10.1145/3568813.3600142
  45. Can Generative Pre-Trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (Turku, Finland) (ITiCSE 2023). Association for Computing Machinery, New York, NY, USA, 117–123. https://doi.org/10.1145/3587102.3588792
  46. Carsten Schulte. 2008. Block Model: An Educational Model of Program Comprehension as a Tool for a Scholarly Approach to Teaching. In Proceedings of the Fourth International Workshop on Computing Education Research (Sydney, Australia) (ICER ’08). Association for Computing Machinery, New York, NY, USA, 149–160. https://doi.org/10.1145/1404520.1404535
  47. An Introduction to Program Comprehension for Computer Science Educators. In Proceedings of the 2010 ITiCSE Working Group Reports (Ankara, Turkey) (ITiCSE-WGR ’10). Association for Computing Machinery, New York, NY, USA, 65–86. https://doi.org/10.1145/1971681.1971687
  48. Surely we must learn to read before we learn to write!. In Proceedings of the Eleventh Australasian Conference on Computing Education - Volume 95 (Wellington, New Zealand) (ACE ’09). Australian Computer Society, Inc., AUS, 165–170.
  49. Juha Sorva and Teemu Sirkiä. 2015. Embedded Questions in Ebooks on Programming: Useful for a) Summative Assessment, b) Formative Assessment, or c) Something Else?. In Proceedings of the 15th Koli Calling Conference on Computing Education Research (Koli, Finland) (Koli Calling ’15). Association for Computing Machinery, New York, NY, USA, 152–156. https://doi.org/10.1145/2828959.2828961
  50. Automated assessment in CS1. In Proceedings of the 8th Australasian Conference on Computing Education - Volume 52 (Hobart, Australia) (ACE ’06). Australian Computer Society, Inc., AUS, 223–228.
  51. Des Traynor and J. Paul Gibson. 2005. Synthesis and analysis of automatic assessment methods in CS1: generating intelligent MCQs. SIGCSE Bull. 37, 1 (feb 2005), 495–499. https://doi.org/10.1145/1047124.1047502
  52. Benefits of Self-Explanation in Introductory Programming. In Proceedings of the 46th ACM Technical Symposium on Computer Science Education (Kansas City, Missouri, USA) (SIGCSE ’15). Association for Computing Machinery, New York, NY, USA, 284–289. https://doi.org/10.1145/2676723.2677260
  53. Michel Wermelinger. 2023. Using GitHub Copilot to Solve Simple Programming Problems. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 172–178. https://doi.org/10.1145/3545945.3569830
  54. The many ways of the BRACElet project. Bulletin of Applied Computing and Information Technology 1 (2007), 1–16.
  55. A theory of instruction for introductory programming skills. Computer Science Education 29, 2-3 (2019), 205–253. https://doi.org/10.1080/08993408.2019.1565235
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Teemu Lehtinen (4 papers)
  2. Charles Koutcheme (6 papers)
  3. Arto Hellas (31 papers)
Citations (2)