Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring (2411.16337v1)

Published 25 Nov 2024 in cs.CL, cs.AI, and cs.HC

Abstract: The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as LLMs, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs' scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman's $r = .74$ with human assessments in the overall score, and an internal consistency of $ICC=.80$. These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality.

Authors (4)

Kathrin Seßler (7 papers)
Maurice Fürstenberg (1 paper)
Babette Bühler (7 papers)
Enkelejda Kasneci (97 papers)

Summary

Evaluating the Efficacy of LLMs in Automated Essay Scoring

The continuous advancement of LLMs presents new opportunities to address longstanding educational challenges, such as the time-intensive task of essay grading. The paper "Can AI grade your essays? A comparative analysis of LLMs and teacher ratings in multidimensional essay scoring" explores the viability of LLMs in grading student essays, specifically targeting the German educational context. This paper offers an evaluative framework for understanding how LLMs compare against human teacher assessments and distinguishes the strengths and limitations of various models.

LLM Performance in Multidimensional Scoring

The research analyzes five LLMs: GPT-3.5, GPT-4, the o1 model, LLaMA 3-70B, and Mixtral 8x7B, assessing German student essays through ten criteria split into content and language categories. The closed-source models, especially GPT-4 and o1, showed high reliability and strong correlations with human ratings, particularly in language-focused criteria such as spelling and expression. Impressively, the o1 model achieved a Spearman correlation of 0.74 with human assessments. These results emphasize the proficiency of closed-source models in mirroring human judgment, primarily due to their extensive pre-training on diverse linguistic datasets that hone their pattern recognition skills in grammar and stylistic nuances.

Contrarily, open-source models like LLaMA 3 and Mixtral exhibited less alignment with human evaluations, evidenced by minimal correlation and lower inter-run consistency. A discernible limitation was their reduced capability to differentiate essay quality effectively, often resulting in mid-range scores irrespective of the actual content quality. Such inconsistency highlights a significant shortcoming in their immediate usability for educational purposes.

Bias, Calibration, and Over-Scoring

A notable outcome of the paper is the tendency of models, particularly GPT-4 and o1, to overrate essays compared to human evaluators. This over-scoring likely stems from intrinsic biases in the models, potentially ingrained during training phases where specific stylistic features are preferentially rewarded. This emphasizes the necessity for calibration strategies, such as dataset-specific fine-tuning, to align LLM behavior more closely with educational grading norms.

Implications for Educational Applications

The differentiation between language and content assessment capabilities underscores a broader implication: while LLMs can reliably address surface-level linguistic evaluation, they need further refinement to mirror human evaluators' depth of understanding in content aspects. This poses a challenge for integrating LLMs as standalone grading entities but positions them as promising adjunct tools in educational settings.

Significantly, the research highlights discrepancies in inter-category correlations between human raters and LLMs, suggesting that LLMs' auto-regressive nature integrates previous evaluations into successive scoring, potentially skewing their assessment consistency. Future work should explore advanced prompting and fine-tuning techniques to mitigate this auto-correlation and enhance content evaluation robustness.

Conclusion

This paper presents a critical exploration of the capabilities and limitations of LLMs in automated essay scoring. While closed-source models such as GPT-4 and o1 demonstrate potential as reliable evaluative tools, particularly in linguistic assessment, they require further refinement to achieve cohesion with human evaluators in content-level assessments. As the field evolves, these insights will be crucial in harnessing LLMs' full potential in educational technology, enhancing essay evaluation's efficiency and reliability, and ultimately supporting teachers in delivering timely, accurate feedback. Future research should continue to refine model training techniques and investigate real-world applications of LLM-based assessments to bridge the gap between technological capability and educational utility.

PDF Markdown

Related Papers

Tweets

https://twitter.com/yishii_0207/status/1862262077219557650