Papers
Topics
Authors
Recent
2000 character limit reached

LLMs can Perform Multi-Dimensional Analytic Writing Assessments: A Case Study of L2 Graduate-Level Academic English Writing (2502.11368v2)

Published 17 Feb 2025 in cs.CL and cs.AI

Abstract: The paper explores the performance of LLMs in the context of multi-dimensional analytic writing assessments, i.e. their ability to provide both scores and comments based on multiple assessment criteria. Using a corpus of literature reviews written by L2 graduate students and assessed by human experts against 9 analytic criteria, we prompt several popular LLMs to perform the same task under various conditions. To evaluate the quality of feedback comments, we apply a novel feedback comment quality evaluation framework. This framework is interpretable, cost-efficient, scalable, and reproducible, compared to existing methods that rely on manual judgments. We find that LLMs can generate reasonably good and generally reliable multi-dimensional analytic assessments. We release our corpus and code for reproducibility.

Summary

  • The paper demonstrates that LLMs can closely match human scoring on L2 writing assessments, differing by at most one point on a 10-point scale.
  • The methodology involves prompting LLMs to generate feedback on nine analytic criteria using literature reviews annotated by human experts.
  • The study highlights LLM-generated comments as more nuanced and detailed, suggesting scalable solutions for L2 English writing assessment.

LLMs can Perform Multi-Dimensional Analytic Writing Assessments: A Case Study of L2 Graduate-Level Academic English Writing

Introduction

LLMs have increasingly demonstrated potential in automating complex tasks traditionally performed by human experts. This paper investigates the capability of LLMs to conduct multi-dimensional analytic writing assessments, specifically focusing on L2 (second language) English writing by graduate students. The primary goal is to assess whether LLMs can provide reliable and informative scores and feedback comments across multiple predefined analytic criteria. Figure 1

Figure 1: Multi-dimensional analytic assessments, where each assessment contains a score and a comment.

Methodology

The paper utilizes a corpus of literature reviews composed by L2 graduate students. These reviews are annotated by human experts based on nine analytic criteria, including material selection, citation practices, grammatical accuracy, and logical structure, among others. Several prominent LLMs were prompted to assess the same corpus under various controlled conditions. The evaluation framework proposed in this paper involves automatically generating feedback comments using LLMs and comparing their quality with human-generated comments. Figure 2

Figure 2: Left: Pipeline of the proposed feedback comment quality evaluation framework.Right: Validation results for the pipeline, showcasing high inter-annotator agreement and exact match rates.

Results

The paper presents substantial findings demonstrating that LLMs can generate reasonably correlated scores with human assessments, often differing by no more than one point on a 10-point scale for most criteria. Critically, LLMs consistently provide comments that identify problems, often more elaborately than human assessors under specific interaction conditions. For example, LLMs are found to be particularly effective in generating detailed feedback when each assessment question is asked independently, highlighting their capability for nuanced analysis (Figure 3). Figure 4

Figure 4: Heatmaps of overall QWK and AAR1 among assessors, indicating varying degrees of agreement across criteria.

Analysis of Feedback Comments

The assessment of feedback comment quality indicates that LLM-generated comments can be precise and potentially more helpful than those generated by humans, given their ability to continuously operate without cognitive fatigue. The paper proposes a novel framework that characterizes comments based on their specificity, helpfulness, and ability to identify problems—a method that offers a finer granularity compared to traditional manual metrics (Figure 2). Figure 3

Figure 3: Distribution of comment characteristics across criteria, including specificity and the frequency of corrections offered.

Implications and Future Directions

The findings suggest that LLMs hold promise for integration into educational contexts, particularly for aiding in the self-regulated learning of L2 students. The potential applications extend to forming a basis for scalable solutions in automated writing assessments that provide consistent and informative feedback. Future research could explore enhancing LLM interactions to further improve their feedback specificity and alignment with pedagogical standards.

Conclusion

This case paper provides concrete evidence of the role LLMs can play in performing multi-dimensional analytic writing assessments, demonstrating their ability to match human evaluations in scoring consistency and comment relevancy. The results underscore the potential for adopting LLMs in educational settings to enhance writing assessment efficiency and feedback quality, though further work is needed to refine these models and explore their applicability across diverse writing contexts. Figure 5

Figure 5: Heatmaps showing correlations between scores and comment characteristics among human and LLM assessments.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 6 likes about this paper.