Automated Assignment Grading with Large Language Models: Insights From a Bioinformatics Course (2501.14499v1)

Published 24 Jan 2025 in cs.LG and cs.CY

Abstract: Providing students with individualized feedback through assignments is a cornerstone of education that supports their learning and development. Studies have shown that timely, high-quality feedback plays a critical role in improving learning outcomes. However, providing personalized feedback on a large scale in classes with large numbers of students is often impractical due to the significant time and effort required. Recent advances in natural language processing and LLMs offer a promising solution by enabling the efficient delivery of personalized feedback. These technologies can reduce the workload of course staff while improving student satisfaction and learning outcomes. Their successful implementation, however, requires thorough evaluation and validation in real classrooms. We present the results of a practical evaluation of LLM-based graders for written assignments in the 2024/25 iteration of the Introduction to Bioinformatics course at the University of Ljubljana. Over the course of the semester, more than 100 students answered 36 text-based questions, most of which were automatically graded using LLMs. In a blind study, students received feedback from both LLMs and human teaching assistants without knowing the source, and later rated the quality of the feedback. We conducted a systematic evaluation of six commercial and open-source LLMs and compared their grading performance with human teaching assistants. Our results show that with well-designed prompts, LLMs can achieve grading accuracy and feedback quality comparable to human graders. Our results also suggest that open-source LLMs perform as well as commercial LLMs, allowing schools to implement their own grading systems while maintaining privacy.

Summary

The paper explores using Large Language Models (LLMs) for automated assignment grading in a university bioinformatics course, assessing their accuracy, feedback quality, and student preferences.
LLMs achieved 85-90% classification accuracy in grading criteria compared to human TAs, with performance decreasing on harder questions, and effectiveness was optimized by using both grading rubrics and examples in prompts.
Automated LLM grading shows potential comparable to human TAs, open-source models perform well, and successful implementation requires structured rubrics, examples, and allowing students to request manual reviews.

The paper presents a paper on using LLM graders in a university classroom setting, specifically in the Introduction to Bioinformatics course at the University of Ljubljana during the 2024/25 winter semester. The paper investigates whether LLMs can replace human teaching assistants in assessing and grading written text answers. The key objective is to evaluate the accuracy and quality of feedback provided by LLMs compared to human graders.

The paper involved 119 students, primarily master's-level computer science students, who participated in five take-home assignments. Each assignment included multiple exercises where students implemented bioinformatics algorithms, applied them to real-world data, visualized findings, and discussed results in written answers to specific questions. The students' text-based answers were reviewed and graded by an LLM. The LLM-assigned grades were used in the final grade unless a human review was requested. Participation was voluntary, and students were not informed whether their submissions were graded by a human or an LLM. After receiving their assignment grade and feedback, students filled out a survey rating their satisfaction with the feedback on each text-based question.

Six LLMs were evaluated: GPT-4o, four versions of the open-source Llama 3 models (7B, 70B, 405B), and Nvidia-70B (Llama-3.1-Nemotron-70B). The Llama 3 models included full-precision versions of Llama-8B and Llama-70B, as well as quantized versions of Llama-70B and Llama-405B, denoted as Llama-70Bq4 and Llama-405Bq4, respectively. The hardware requirements of the larger models were reduced through quantization. A grading group with human teaching assistant-written feedback, revised with LLMs' tone of writing (TA-GPT-revised), was included to disentangle the tone and content preferences.

The prompts used for evaluating student answers consisted of a generic system prompt and an exercise-specific user prompt. The user prompt included the question, predefined correct answer, student submission, grading rubric, and several TA-graded examples. The grading rubric specified criteria and point allotments, with each criterion including an explanation section. The points from the satisfied criteria were tallied into a final numeric score. The LLMs were prompted to return a structured response containing the submission score, written feedback, and a list of satisfied rubric criteria. The grading examples section contained up to 10 examples of graded submissions, grouped based on uniquely satisfying grading criteria.

The paper assessed the accuracy of the LLMs' grades by comparing them to manually evaluated submissions by human TAs. The 36 exercises were categorized into five difficulty levels: trivial, easy, medium, hard, and open-ended. Classification accuracy was used to evaluate the performance of the LLMs in determining whether a submission satisfies particular criteria. The results showed that the LLMs achieved classification accuracy scores ranging from 85\% to 90\%, except for Llama-8B, which had a classification accuracy of 75\%. The classification accuracy decreased as the difficulty of the exercises increased. An analysis of the average differences in the matched grading criteria indicated whether the models tended to be more lenient or stringent than the TAs.

The impact of including the grading rubric and grading examples in the prompts was investigated. Prompts with only grading rubrics led to stricter grading, while prompts with only grading examples resulted in more lenient grading. Including both the grading rubric and grading examples produced the best results. The paper found that for simpler questions, LLMs achieved satisfactory performance using grading examples alone, but for harder and open-ended questions, providing a grading rubric was essential.

Student preferences for LLM-based feedback were assessed using student satisfaction scores for a total of 1,527 answers. A Bayesian mixed-effects linear regression model was used to determine student preferences for individual grader feedback. The results indicated that, overall, students exhibited no significant preferences for any of the graders, except for Llama-405Bq4, which students appeared to prefer slightly. However, when examining feedback preferences separately for correctly and incorrectly answered questions, students showed no significant dislike for feedback generated by any of the LLMs, except for Nvidia-70B.

At the end of the semester, a survey was conducted to gather student attitudes toward using LLMs as assignment graders. The results showed that after completing the course, students were more open to LLM graders. Most students reported using LLM-enabled tools while working on the assignments and felt it was fair for them to use LLMs when solving the assignments if graded by LLMs. Students strongly felt it would be unacceptable not to have the option to request a manual review.

Based on the findings, the paper recommends using structured grading rubrics, including graded examples, testing new grading rubrics, selecting the largest open-source LLM supported by the available hardware, and allowing students to request a manual review.

The paper concludes that automated grading can achieve performance comparable to human teaching assistants in scoring and feedback generation. Open-source models can perform on par with commercial alternatives, offering institutions greater control over their grading processes. The paper acknowledges limitations, such as the probabilistic nature of LLMs and the potential for prompt-hacking, and suggests implementing robust safeguards.

PDF Markdown

Tweets

https://twitter.com/WGOV/status/1883997734937522377

https://twitter.com/pavlinpolicar/status/1884664886141010151

Automated Assignment Grading with Large Language Models: Insights From a Bioinformatics Course (2501.14499v1)

Summary

Related Papers

Tweets