Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders (2406.16510v1)

Published 24 Jun 2024 in econ.GN and q-fin.EC

Abstract: This study investigates the efficacy of LLMs as tools for grading master-level student essays. Utilizing a sample of 60 essays in political science, the study compares the accuracy of grades suggested by the GPT-4 model with those awarded by university teachers. Results indicate that while GPT-4 aligns with human grading standards on mean scores, it exhibits a risk-averse grading pattern and its interrater reliability with human raters is low. Furthermore, modifications in the grading instructions (prompt engineering) do not significantly alter AI performance, suggesting that GPT-4 primarily assesses generic essay characteristics such as language quality rather than adapting to nuanced grading criteria. These findings contribute to the understanding of AI's potential and limitations in higher education, highlighting the need for further development to enhance its adaptability and sensitivity to specific educational assessment requirements.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Magnus Lundgren (4 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com