Papers
Topics
Authors
Recent
Search
2000 character limit reached

Using Large Language Models to Assign Partial Credit to Students' Explanations of Problem-Solving Process: Grade at Human Level Accuracy with Grading Confidence Index and Personalized Student-facing Feedback

Published 9 Dec 2024 in physics.ed-ph | (2412.06910v2)

Abstract: This study examines the feasibility and potential advantages of using LLMs, in particular GPT-4o, to perform partial credit grading of large numbers of student written responses to introductory level physics problems. Students were instructed to write down verbal explanations of their reasoning process when solving one conceptual and two numerical calculation problems on in class exams. The explanations were then graded according to a 3-item rubric with each item grades as binary (1 or 0). We first demonstrate that machine grading using GPT-4o with no examples nor reference answer can reliably agree with human graders on 70%-80% of all cases, which is equal to or higher than the level at which two human graders agree with each other. Two methods are essential for achieving this level of accuracy: 1. Adding explanation language to each rubric item that targets the errors of initial machine grading. 2. Running the grading process 5 times and taking the most frequent outcome. Next, we show that the variation in outcomes across 5 machine grading attempts as measured by the Shannon Entropy can serve as a grading confidence index, allowing a human instructor to identify ~40% of all potentially incorrect gradings by reviewing just 10 - 15% of all responses. Finally, we show that it is straightforward to use GPT-4o to write clear explanations of the partial credit grading outcomes. Those explanations can be used as feedback for students, which will allow students to understand their grades and raise different opinions when necessary. Almost all feedback messages generated were rated 3 or above on a 5-point scale by two experienced instructors. The entire grading and feedback generating process cost roughly $5 per 100 student answers, which shows immense promise for automating labor-intensive grading process by a combination of machine grading with human input and supervision.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.