Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course (2403.16977v1)

Published 25 Mar 2024 in cs.CL
A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

Abstract: This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, both with and without prompt engineering, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language. Comparing 50 student submissions to 50 AI-generated submissions across different categories, and marked blindly by three independent markers, we amassed $n = 300$ data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = $2.482 \times 10{-10}$). Prompt engineering significantly improved scores for both GPT-4 (p = $1.661 \times 10{-4}$) and GPT-3.5 (p = $4.967 \times 10{-9}$). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from Definitely AI' toDefinitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary AI' orHuman' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.

A Comparative Analysis of GPT-3.5, GPT-4, and Student Performance in University-Level Coding Assessments

Introduction

This paper embarks on a scrutinized examination of the performance disparity between advanced LLMs, specifically GPT-3.5 and GPT-4, and university students within the context of coding assignments. These assignments form a crucial component of a physics curriculum at Durham University, emphasizing Python programming. The inquiry explores the potential of LLMs, both raw and with prompt engineering, against student outputs and a mixed category comprising student and GPT-4 contributions. Results from this comparative analysis could illuminate the evolving capabilities of AI in educational settings, offering insights into the utility, integrity, and future direction of coding assessments.

Methodological Overview

In assessing the effectiveness of LLMs in university-level coding tasks, this paper employed a blinded marking mechanism to evaluate code produced by students and AI. The emphasis on producing clear, well-labeled plots for elucidating physics scenarios underlines the differentiation of this paper from prior research focusing on AI's coding puzzle-solving abilities. The coursework evaluated comes from the Laboratory Skills and Electronics module, with submissions from 103 consenting students forming a comparative base against 50 AI-generated outputs across different categories. The AI submissions were processed through a minimal and a prompt engineering-enhanced pathway to discern potential performance variations attributable to these methods.

Analysis Dimensions

The nuanced approach to comparing human and AI-generated coding submissions involved:

  • Prompt Engineering: Evaluating its influence on AI performance.
  • Authorship Identification: Analysis of markers' capability to discern the origin of submissions (AI vs. Human).
  • Score Comparison: A statistical examination of performance across various submission categories.

Key Findings

Score Disparity and the Impact of Prompt Engineering

The paper unveiled a statistically significant performance gap between student submissions, which averaged 91.9%, and the highest-performing AI category (GPT-4 with prompt engineering) at 81.1%. Remarkably, prompt engineering consistently improved AI performance across both GPT-3.5 and GPT-4 models, reinforcing its significance in enhancing LLM output quality. However, mixed submissions, integrating student and GPT-4 efforts, unexpectedly underperformed in comparison to solely AI or student submissions, underscoring the complexity and challenges in merging AI with human efforts.

Efficacy in Distinguishing Between Human and AI Submissions

Blind markers tasked with identifying the authorship of submissions accurately pinpointed human-authored work in 92.1% of cases. This high rate of accuracy, averaging at 85.3% across a simplified binary categorization (AI vs. Human), demonstrates an intriguing capability to detect the nuanced differences between AI and human outputs, particularly within the field of coding assignments in physics.

Implications of the Research

The discernible difference in quality between student and AI submissions, alongside the successful identification by markers, suggests that while AI can closely simulate human work quality, subtle distinctions remain detectable. These findings may have immediate applications in the academic sector, especially concerning academic integrity and the customizable implementation of AI as an educational tool. More broadly, they signal a need for continued investigation into the integration of AI in educational settings to maximize benefits while mitigating potential drawbacks.

Future Directions

As AI continues to evolve, so too will its potential impact on educational practices and assessments. This paper's insights into the current limitations and capabilities of LLMs in coding assessments could guide the future development of curricula that harmonize traditional educational objectives with the innovative possibilities presented by AI. It also raises pivotal questions about the nature of learning and assessment, encouraging a reassessment of what skills and knowledge are prioritized and evaluated in our rapidly changing digital landscape.

Conclusion

In summary, this analysis offers a timely exploration into the intersections of AI capabilities with university-level education, specifically within the coding discipline of a physics degree. While AI, particularly GPT-4 with prompt engineering, demonstrates considerable prowess in approaching the task quality of university students, distinct differences, especially in creative output aspects, remain. These nuances not only highlight the current state of AI in educational settings but also chart a course for integrating these technologies in a manner that enhances learning outcomes and academic integrity. As such, the paper functions as both a benchmark for current AI performance in education and a beacon for future explorations into this dynamic field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. “Evaluating large language models trained on code” In arXiv preprint arXiv:2107.03374, 2021
  2. “Program synthesis with large language models” In arXiv preprint arXiv:2108.07732, 2021
  3. “Is ChatGPT the Ultimate Programming Assistant–How far is it?” In arXiv preprint arXiv:2304.11938, 2023
  4. “The impact of AI in physics education: a comprehensive review from GCSE to university levels” In Physics Education 59.2 IOP Publishing, 2024, pp. 025010 DOI: 10.1088/1361-6552/ad1fa2
  5. “Evaluating AI and Human Authorship Quality in Academic Writing through Physics Essays” In arXiv preprint arXiv:2403.05458, 2024 arXiv: https://arxiv.org/abs/2403.05458
  6. Colin G West “AI and the FCI: Can ChatGPT project an understanding of introductory physics?” In arXiv preprint arXiv:2303.01067, 2023
  7. Gerd Kortemeyer “Could an artificial-intelligence agent pass an introductory physics course?” In Physical Review Physics Education Research 19.1 APS, 2023, pp. 010132
  8. “Performance of ChatGPT on the test of understanding graphs in kinematics” In Physical Review Physics Education Research 20.1 APS, 2024, pp. 010109
  9. “How understanding large language models can inform the use of ChatGPT in physics education” In European Journal of Physics 45.2 IOP Publishing, 2024, pp. 025701
  10. “More Than Meets the AI: Evaluating the performance of GPT-4 on Computer Graphics assessment questions” In Proceedings of the 26th Australasian Computing Education Conference, 2024, pp. 182–191
  11. OpenAI “Best practices for prompt engineering with OpenAI API” https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api, 2023
  12. “Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination” In Scientific Reports 13.1 Nature Publishing Group UK London, 2023, pp. 20512
  13. “Evaluating GPT-3.5 and GPT-4 models on Brazilian university admission exams” In arXiv preprint arXiv:2303.17003, 2023
  14. “Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools” In Queue 20.6 ACM New York, NY, USA, 2022, pp. 35–57
  15. “GitHub Copilot AI pair programmer: Asset or Liability?” In Journal of Systems and Software 203, 2023, pp. 111734 DOI: https://doi.org/10.1016/j.jss.2023.111734
  16. “Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming” In arXiv preprint arXiv:2306.05153, 2023
  17. “Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality” In Harvard Business School Technology & Operations Mgt. Unit Working Paper, 2023
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Will Yeadon (9 papers)
  2. Alex Peach (6 papers)
  3. Craig P. Testrow (1 paper)
Citations (4)
Youtube Logo Streamline Icon: https://streamlinehq.com