Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals (2405.19433v2)

Published 29 May 2024 in cs.CL

Abstract: While current Automated Essay Scoring (AES) methods demonstrate high scoring agreement with human raters, their decision-making mechanisms are not fully understood. Our proposed method, using counterfactual intervention assisted by LLMs, reveals that BERT-like models primarily focus on sentence-level features, whereas LLMs such as GPT-3.5, GPT-4 and Llama-3 are sensitive to conventions & accuracy, language complexity, and organization, indicating a more comprehensive rationale alignment with scoring rubrics. Moreover, LLMs can discern counterfactual interventions when giving feedback on essays. Our approach improves understanding of neural AES methods and can also apply to other domains seeking transparency in model-driven decisions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yupei Wang (16 papers)
  2. Renfen Hu (8 papers)
  3. Zhe Zhao (97 papers)
Citations (1)

Summary

Diagnosing the Rationale Alignment in Automated Essay Scoring Methods

Overview

The paper "Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals" addresses a crucial aspect of automated essay scoring (AES) systems. While AES models have shown high alignment with human raters, their decision-making mechanisms remain inadequately explained. This work introduces a novel diagnostic approach using linguistically-informed counterfactual interventions to probe these mechanisms in both traditional NLP models and LLMs.

Key Contributions

The authors present a robust methodology that integrates linguistic knowledge from essay scoring rubrics—such as conventions, language complexity, and organization—with LLMs to generate counterfactual interventions. This approach systematically reveals the models' scoring basis beyond mere agreement with human raters.

Methodology

The paper involves several detailed steps:

  1. Concept Extraction:

Target linguistic concepts are identified from essay scoring rubrics of major standardized tests including IELTS, TOEFL iBT, and others. The focus is placed on: - Conventions: Adherence to standard English rules. - Language Complexity: Vocabulary richness and syntactic variety. - Organization: Logical structure and coherence.

  1. Counterfactual Generation: Using both LLMs and rule-based techniques, counterfactual essays are generated by altering specific linguistic features while preserving content and fluency.
  2. Model Evaluation: The authors fine-tune BERT, RoBERTa, and DeBERTa models on specific datasets (TOEFL11 and ELLIPSE), and compare their performance with LLMs like GPT-3.5 and GPT-4 in zero-shot and few-shot learning settings.

Experiments and Results

The experimental results provide several insights:

  • Agreement and Alignment:

BERT-like models exhibit higher agreement with human raters but display limitations in recognizing organizational features of essays. In contrast, LLMs, particularly after few-shot learning or fine-tuning, not only align better with scoring rubrics but also achieve high score agreement.

  • Counterfactual Interventions:

The paper demonstrates that traditional models respond to conventions and language complexity but fail to account for logical structure and coherence. LLMs show sensitivity to all targeted linguistic concepts, indicating a more comprehensive rationale alignment.

  • Feedback Generation:

LLMs are employed to generate feedback for essays, which further supports their adherence to the scoring rubrics. The quality of feedback is manually evaluated, and LLMs show discernible differences between feedback for original and counterfactual essays.

Implications and Future Work

This research underscores the importance of assessing both agreement and rationale alignment in AES systems. The findings suggest that while BERT-like models may rank higher on traditional agreement metrics, LLMs offer superior alignment with human rationale when properly fine-tuned.

The implications of this paper are significant for the development and deployment of AES systems in educational settings. By ensuring that models not only agree with human raters but also follow a similar rationale, we can enhance their reliability and transparency in high-stakes testing scenarios.

Moreover, the approach can be generalized to other domains where transparency in model-driven decisions is critical. The use of LLMs for generating counterfactual samples marks a substantial advancement in the explainability and accountability of machine learning models.

Conclusion

The paper "Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals" provides a significant contribution to the field of AES. By employing linguistically-informed counterfactuals, the authors reveal important distinctions in how traditional models and LLMs process and score essays. This method enhances our understanding of model reasoning, paving the way for more transparent and accountable applications of neural AES systems. Future research could extend these findings by exploring additional scoring dimensions and evaluating comprehensive feedback mechanisms further.