Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automated Essay Scoring Using Grammatical Variety and Errors with Multi-Task Learning and Item Response Theory (2406.08817v1)

Published 13 Jun 2024 in cs.CL

Abstract: This study examines the effect of grammatical features in automatic essay scoring (AES). We use two kinds of grammatical features as input to an AES model: (1) grammatical items that writers used correctly in essays, and (2) the number of grammatical errors. Experimental results show that grammatical features improve the performance of AES models that predict the holistic scores of essays. Multi-task learning with the holistic and grammar scores, alongside using grammatical features, resulted in a larger improvement in model performance. We also show that a model using grammar abilities estimated using Item Response Theory (IRT) as the labels for the auxiliary task achieved comparable performance to when we used grammar scores assigned by human raters. In addition, we weight the grammatical features using IRT to consider the difficulty of grammatical items and writers' grammar abilities. We found that weighting grammatical features with the difficulty led to further improvement in performance.

Summary

  • The paper introduces an AES model that integrates positive and negative grammatical features using a multi-task learning framework to improve holistic score prediction.
  • It employs Item Response Theory to weight grammatical items, enabling automatic estimation of grammar difficulty and reducing reliance on manual scores.
  • Extensive experiments on ASAP datasets demonstrate that the combined approach of detailed grammatical analysis and multi-task learning substantially boosts scoring reliability.

Automated Essay Scoring Using Grammatical Variety and Errors with Multi-Task Learning and Item Response Theory

This paper presents a paper on the application of grammatical features in Automated Essay Scoring (AES). The researchers employ two primary types of grammatical features as inputs to their AES model: correctly used grammatical items and the frequency of grammatical errors within the essays. The incorporation of these features aims to enhance the performance of AES models in predicting holistic essay scores.

Methodology and Architecture

The proposed approach involves several innovative steps:

  1. Grammatical Features:
    • Positive Linguistic Features (PFs): These refer to grammatical items that writers have used correctly. The paper uses CEFR-J Grammar Profile to extract 256 grammatical items.
    • Negative Linguistic Features (NFs): These refer to the number and types of grammatical errors, categorized into 54 types, as detected by the GECToR-large model.
  2. Multi-Task Learning (MTL):
    • The researchers designed an MTL framework to predict both holistic scores and grammar scores.
    • Two types of grammar scores were used: human-rated grammar scores and writer’s grammar abilities estimated using Item Response Theory (IRT).
  3. Item Response Theory (IRT):
    • IRT is utilized to estimate grammatical abilities and to weight grammatical items based on their difficulty.
    • Various modifications of the PFs were weighted using IRT parameters, such as the difficulty of individual grammatical items and the probability of correct usage given a writer's ability.

The architecture of the AES models is based on BERT, with the essay representations obtained from the [CLS] token. The grammatical features are combined with these representations using several architectures: simple concatenation, feeding through a neural network before concatenation (net), and implementing an auxiliary task predicting grammar scores (multi and dual).

Experimental Results

The researchers conducted extensive experiments using the ASAP and ASAP++ datasets, which include essays with both holistic and analytic scores:

  • Multi-Task Learning Performance: The use of MTL yielded superior results compared to simple concatenation methods. Specifically, the dual-task model, which outputs grammar scores, showed significant performance improvements.
  • Grammatical Feature Impact: Incorporating individual grammatical items and error frequencies as features yielded consistent improvements over baseline models. The best performance was achieved by models taking into account the difficulty of grammatical items via IRT (multiply_b).
  • IRT-Based Scoring: The paper showed that models using estimated grammatical abilities via IRT for the auxiliary task achieved performance comparable to those using human-rated grammar scores. This is particularly beneficial as it avoids the need for labor-intensive human annotation of grammar scores.

Implications and Future Directions

The findings underline the importance of grammatical features in AES models. While traditional AES models emphasize holistic and content-based evaluation, the introduction of detailed grammatical analysis enhances their scoring accuracy. Practically, this approach can lead to more reliable, nuanced grading systems that are less dependent on human raters, thus addressing reliability concerns tied to human scoring.

The paper’s use of IRT to weight grammatical features introduces a novel method of accounting for grammatical complexity and writer ability, thereby refining score predictions. Additionally, the effective use of IRT provides a richer interpretative framework, allowing educational practitioners to better understand the contribution of specific linguistic features to writing proficiency.

Future research could delve into refining the combination techniques for PFs and NFs, testing the efficacy of grammatical features in conjunction with LLMs, and exploring more interpretable model outputs to provide educational insights. Furthermore, broader applications of grammatical feature extraction in other languages and contexts could extend the utility of this research in global educational assessment systems.

Overall, this paper contributes to the field of Automated Essay Scoring by demonstrating that detailed grammatical analysis and multi-task learning frameworks substantially improve the accuracy and reliability of AES models.

X Twitter Logo Streamline Icon: https://streamlinehq.com