Papers
Topics
Authors
Recent
Search
2000 character limit reached

A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

Published 25 Mar 2024 in cs.CL | (2403.16977v1)

Abstract: This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, both with and without prompt engineering, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language. Comparing 50 student submissions to 50 AI-generated submissions across different categories, and marked blindly by three independent markers, we amassed $n = 300$ data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = $2.482 \times 10{-10}$). Prompt engineering significantly improved scores for both GPT-4 (p = $1.661 \times 10{-4}$) and GPT-3.5 (p = $4.967 \times 10{-9}$). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from Definitely AI' toDefinitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary AI' orHuman' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. “Evaluating large language models trained on code” In arXiv preprint arXiv:2107.03374, 2021
  2. “Program synthesis with large language models” In arXiv preprint arXiv:2108.07732, 2021
  3. “Is ChatGPT the Ultimate Programming Assistant–How far is it?” In arXiv preprint arXiv:2304.11938, 2023
  4. “The impact of AI in physics education: a comprehensive review from GCSE to university levels” In Physics Education 59.2 IOP Publishing, 2024, pp. 025010 DOI: 10.1088/1361-6552/ad1fa2
  5. “Evaluating AI and Human Authorship Quality in Academic Writing through Physics Essays” In arXiv preprint arXiv:2403.05458, 2024 arXiv: https://arxiv.org/abs/2403.05458
  6. Colin G West “AI and the FCI: Can ChatGPT project an understanding of introductory physics?” In arXiv preprint arXiv:2303.01067, 2023
  7. Gerd Kortemeyer “Could an artificial-intelligence agent pass an introductory physics course?” In Physical Review Physics Education Research 19.1 APS, 2023, pp. 010132
  8. “Performance of ChatGPT on the test of understanding graphs in kinematics” In Physical Review Physics Education Research 20.1 APS, 2024, pp. 010109
  9. “How understanding large language models can inform the use of ChatGPT in physics education” In European Journal of Physics 45.2 IOP Publishing, 2024, pp. 025701
  10. “More Than Meets the AI: Evaluating the performance of GPT-4 on Computer Graphics assessment questions” In Proceedings of the 26th Australasian Computing Education Conference, 2024, pp. 182–191
  11. OpenAI “Best practices for prompt engineering with OpenAI API” https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api, 2023
  12. “Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination” In Scientific Reports 13.1 Nature Publishing Group UK London, 2023, pp. 20512
  13. “Evaluating GPT-3.5 and GPT-4 models on Brazilian university admission exams” In arXiv preprint arXiv:2303.17003, 2023
  14. “Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools” In Queue 20.6 ACM New York, NY, USA, 2022, pp. 35–57
  15. “GitHub Copilot AI pair programmer: Asset or Liability?” In Journal of Systems and Software 203, 2023, pp. 111734 DOI: https://doi.org/10.1016/j.jss.2023.111734
  16. “Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming” In arXiv preprint arXiv:2306.05153, 2023
  17. “Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality” In Harvard Business School Technology & Operations Mgt. Unit Working Paper, 2023
Citations (4)

Summary

  • The paper demonstrates that student coding submissions averaged 91.9% versus GPT-4's 81.1% with prompt engineering, highlighting current AI limitations.
  • The paper employs blinded marking to compare outputs, achieving up to 92.1% accuracy in distinguishing human from AI-generated code.
  • The paper emphasizes practical implications for education as it calls for further exploration into combining AI and human work in coding assessments.

A Comparative Analysis of GPT-3.5, GPT-4, and Student Performance in University-Level Coding Assessments

Introduction

This study embarks on a scrutinized examination of the performance disparity between advanced LLMs, specifically GPT-3.5 and GPT-4, and university students within the context of coding assignments. These assignments form a crucial component of a physics curriculum at Durham University, emphasizing Python programming. The inquiry explores the potential of LLMs, both raw and with prompt engineering, against student outputs and a mixed category comprising student and GPT-4 contributions. Results from this comparative analysis could illuminate the evolving capabilities of AI in educational settings, offering insights into the utility, integrity, and future direction of coding assessments.

Methodological Overview

In assessing the effectiveness of LLMs in university-level coding tasks, this study employed a blinded marking mechanism to evaluate code produced by students and AI. The emphasis on producing clear, well-labeled plots for elucidating physics scenarios underlines the differentiation of this study from prior research focusing on AI's coding puzzle-solving abilities. The coursework evaluated comes from the Laboratory Skills and Electronics module, with submissions from 103 consenting students forming a comparative base against 50 AI-generated outputs across different categories. The AI submissions were processed through a minimal and a prompt engineering-enhanced pathway to discern potential performance variations attributable to these methods.

Analysis Dimensions

The nuanced approach to comparing human and AI-generated coding submissions involved:

  • Prompt Engineering: Evaluating its influence on AI performance.
  • Authorship Identification: Analysis of markers' capability to discern the origin of submissions (AI vs. Human).
  • Score Comparison: A statistical examination of performance across various submission categories.

Key Findings

Score Disparity and the Impact of Prompt Engineering

The study unveiled a statistically significant performance gap between student submissions, which averaged 91.9%, and the highest-performing AI category (GPT-4 with prompt engineering) at 81.1%. Remarkably, prompt engineering consistently improved AI performance across both GPT-3.5 and GPT-4 models, reinforcing its significance in enhancing LLM output quality. However, mixed submissions, integrating student and GPT-4 efforts, unexpectedly underperformed in comparison to solely AI or student submissions, underscoring the complexity and challenges in merging AI with human efforts.

Efficacy in Distinguishing Between Human and AI Submissions

Blind markers tasked with identifying the authorship of submissions accurately pinpointed human-authored work in 92.1% of cases. This high rate of accuracy, averaging at 85.3% across a simplified binary categorization (AI vs. Human), demonstrates an intriguing capability to detect the nuanced differences between AI and human outputs, particularly within the field of coding assignments in physics.

Implications of the Research

The discernible difference in quality between student and AI submissions, alongside the successful identification by markers, suggests that while AI can closely simulate human work quality, subtle distinctions remain detectable. These findings may have immediate applications in the academic sector, especially concerning academic integrity and the customizable implementation of AI as an educational tool. More broadly, they signal a need for continued investigation into the integration of AI in educational settings to maximize benefits while mitigating potential drawbacks.

Future Directions

As AI continues to evolve, so too will its potential impact on educational practices and assessments. This study's insights into the current limitations and capabilities of LLMs in coding assessments could guide the future development of curricula that harmonize traditional educational objectives with the innovative possibilities presented by AI. It also raises pivotal questions about the nature of learning and assessment, encouraging a reassessment of what skills and knowledge are prioritized and evaluated in our rapidly changing digital landscape.

Conclusion

In summary, this analysis offers a timely exploration into the intersections of AI capabilities with university-level education, specifically within the coding discipline of a physics degree. While AI, particularly GPT-4 with prompt engineering, demonstrates considerable prowess in approaching the task quality of university students, distinct differences, especially in creative output aspects, remain. These nuances not only highlight the current state of AI in educational settings but also chart a course for integrating these technologies in a manner that enhances learning outcomes and academic integrity. As such, the study functions as both a benchmark for current AI performance in education and a beacon for future explorations into this dynamic field.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 17 tweets with 1449 likes about this paper.