A Comparative Analysis of GPT-3.5, GPT-4, and Student Performance in University-Level Coding Assessments
Introduction
This paper embarks on a scrutinized examination of the performance disparity between advanced LLMs, specifically GPT-3.5 and GPT-4, and university students within the context of coding assignments. These assignments form a crucial component of a physics curriculum at Durham University, emphasizing Python programming. The inquiry explores the potential of LLMs, both raw and with prompt engineering, against student outputs and a mixed category comprising student and GPT-4 contributions. Results from this comparative analysis could illuminate the evolving capabilities of AI in educational settings, offering insights into the utility, integrity, and future direction of coding assessments.
Methodological Overview
In assessing the effectiveness of LLMs in university-level coding tasks, this paper employed a blinded marking mechanism to evaluate code produced by students and AI. The emphasis on producing clear, well-labeled plots for elucidating physics scenarios underlines the differentiation of this paper from prior research focusing on AI's coding puzzle-solving abilities. The coursework evaluated comes from the Laboratory Skills and Electronics module, with submissions from 103 consenting students forming a comparative base against 50 AI-generated outputs across different categories. The AI submissions were processed through a minimal and a prompt engineering-enhanced pathway to discern potential performance variations attributable to these methods.
Analysis Dimensions
The nuanced approach to comparing human and AI-generated coding submissions involved:
- Prompt Engineering: Evaluating its influence on AI performance.
- Authorship Identification: Analysis of markers' capability to discern the origin of submissions (AI vs. Human).
- Score Comparison: A statistical examination of performance across various submission categories.
Key Findings
Score Disparity and the Impact of Prompt Engineering
The paper unveiled a statistically significant performance gap between student submissions, which averaged 91.9%, and the highest-performing AI category (GPT-4 with prompt engineering) at 81.1%. Remarkably, prompt engineering consistently improved AI performance across both GPT-3.5 and GPT-4 models, reinforcing its significance in enhancing LLM output quality. However, mixed submissions, integrating student and GPT-4 efforts, unexpectedly underperformed in comparison to solely AI or student submissions, underscoring the complexity and challenges in merging AI with human efforts.
Efficacy in Distinguishing Between Human and AI Submissions
Blind markers tasked with identifying the authorship of submissions accurately pinpointed human-authored work in 92.1% of cases. This high rate of accuracy, averaging at 85.3% across a simplified binary categorization (AI vs. Human), demonstrates an intriguing capability to detect the nuanced differences between AI and human outputs, particularly within the field of coding assignments in physics.
Implications of the Research
The discernible difference in quality between student and AI submissions, alongside the successful identification by markers, suggests that while AI can closely simulate human work quality, subtle distinctions remain detectable. These findings may have immediate applications in the academic sector, especially concerning academic integrity and the customizable implementation of AI as an educational tool. More broadly, they signal a need for continued investigation into the integration of AI in educational settings to maximize benefits while mitigating potential drawbacks.
Future Directions
As AI continues to evolve, so too will its potential impact on educational practices and assessments. This paper's insights into the current limitations and capabilities of LLMs in coding assessments could guide the future development of curricula that harmonize traditional educational objectives with the innovative possibilities presented by AI. It also raises pivotal questions about the nature of learning and assessment, encouraging a reassessment of what skills and knowledge are prioritized and evaluated in our rapidly changing digital landscape.
Conclusion
In summary, this analysis offers a timely exploration into the intersections of AI capabilities with university-level education, specifically within the coding discipline of a physics degree. While AI, particularly GPT-4 with prompt engineering, demonstrates considerable prowess in approaching the task quality of university students, distinct differences, especially in creative output aspects, remain. These nuances not only highlight the current state of AI in educational settings but also chart a course for integrating these technologies in a manner that enhances learning outcomes and academic integrity. As such, the paper functions as both a benchmark for current AI performance in education and a beacon for future explorations into this dynamic field.