Exploring ChatGPT's Capacity to Answer Program Comprehension Questions from Self-Generated Code
Introduction
Researchers at Aalto University have conducted an insightful analysis into the performance of state-of-the-art LLMs, particularly GPT-3.5 and GPT-4, in answering Questions about Learners' Code (QLCs). These QLCs were formulated from code snippets generated by the LLMs themselves, targeting a dual purpose: assessing the models' comprehension of programming constructs they create and understanding common error patterns in their responses.
Experiment Design
The experiment followed a structured sequence:
- The LLMs were tasked with generating program code based on provided exercise descriptions.
- From these generated programs, QLCs were automatically produced using the QLCpy-library.
- The LLMs subsequently attempted to answer these QLCs.
- Finally, the researchers manually analyzed the correctness of the LLM responses and categorized errors.
The QLCs aimed to test various aspects of program comprehension, such as variable roles, loop behaviors, and line-specific purposes, reflecting different cognitive levels in program understanding.
Findings and Observations
Performance Summary
Overall, GPT-4 demonstrated superior performance over GPT-3.5 across most QLC types, confirming the incremental improvements in newer LLM generations. The success rate varied significantly depending on the QLC type, with both models showing robust performance in identifying function parameters and variable names but struggling with more dynamic aspects like loop behaviors and trace requirements.
Error Analysis
A detailed error analysis highlighted both models' pitfalls:
- Logical Errors: Both models occasionally produced illogical steps in code execution or misunderstood code semantics, issues also common among novice programmers.
- Line Numbering Issues: Misinterpretation of line references within code suggests possible improvements in how LLMs map physical code structure during generation and comprehension tasks.
- Response Inconsistencies: Particularly in GPT-3.5, inconsistencies in answer justification revealed a lack of coherence, where valid logical deductions were followed by incorrect final answers, or vice versa.
- Hallucination in Justifications: GPT-4 occasionally adhered to an initially incorrect answer, fabricating justifications to support it, a phenomenon less observed in human cognition.
Implications and Future Opportunities
This research illuminates several pathways and considerations:
- Model Training and Fine-Tuning: Enhancing training regimes to better encompass and distinguish between syntactic and semantic elements of code could improve LLM performance in both generating and comprehending code.
- Educational Tools Development: LLMs could be integrated into educational platforms not just for solving problems but for generating pedagogical content, such as automated question generation and answer explanation models.
- Comparative Studies with Human Learners: Similarities in error patterns between LLMs and students invite further studies to compare learning behaviors and miscomprehensions, potentially using LLM outputs as training data for educational research.
Conclusions
While the LLMs exhibited remarkable capabilities in answering self-generated code comprehension questions, evident limitations call for cautious optimism. The encountered errors, especially in logical reasoning and structural interpretation, underscore the challenges remaining in AI understanding of human-like code comprehension. Future LLM developments and applications, particularly in educational contexts, must carefully consider these aspects to leverage strengths and mitigate shortcomings effectively.