Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can GPT-4 do L2 analytic assessment? (2404.18557v1)

Published 29 Apr 2024 in cs.CL

Abstract: Automated essay scoring (AES) to evaluate second language (L2) proficiency has been a firmly established technology used in educational contexts for decades. Although holistic scoring has seen advancements in AES that match or even exceed human performance, analytic scoring still encounters issues as it inherits flaws and shortcomings from the human scoring process. The recent introduction of LLMs presents new opportunities for automating the evaluation of specific aspects of L2 writing proficiency. In this paper, we perform a series of experiments using GPT-4 in a zero-shot fashion on a publicly available dataset annotated with holistic scores based on the Common European Framework of Reference and aim to extract detailed information about their underlying analytic components. We observe significant correlations between the automatically predicted analytic scores and multiple features associated with the individual proficiency components.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Stefano Bannò (11 papers)
  2. Hari Krishna Vydana (7 papers)
  3. Kate M. Knill (13 papers)
  4. Mark J. F. Gales (37 papers)
Citations (5)

Summary

  • The paper demonstrates GPT-4’s capability to perform analytic scoring on individual language components, showing strong alignment with holistic scores.
  • It employs a zero-shot approach on CEFR-rated essays to assess grammar, vocabulary, and coherence, confirming GPT-4’s nuanced language evaluation.
  • The study implies that detailed, component-specific feedback from GPT-4 could enhance targeted language instruction and learner improvement.

Exploring GPT-4 for Analytic Scoring in Language Proficiency Assessments

Introduction: The Evolution of Automated Essay Scoring

Automated Essay Scoring (AES) has been a vital tool in educational technology, allowing for quick and efficient assessment of learner essays. Traditionally, models like e-rater and Longformer have focused on holistic scoring, which provides a single overall grade for an essay. However, holistic scoring, though efficient, often overlooks the detailed feedback required for language learning, where specific skills like grammar, vocabulary, and coherence need individual attention.

Transition to Analytic Scoring Using GPT-4

The move from holistic to analytic scoring using AI tools like GPT-4 introduces nuanced evaluation, where each language component such as grammar or vocabulary is graded separately. This detailed assessment is crucial for language learners as it helps pinpoint specific areas of strength and weakness, thus enabling targeted improvements.

Experiment Setup: Leveraging GPT-4 for Specific Language Components

To test the effectiveness of GPT-4 in analytic scoring, researchers employed it in a zero-shot setting—meaning the model had no prior specific training on performing this task but relied on its general language capabilities. They utilized essays from the W{content}I dataset rated on the Common European Framework of Reference (CEFR), focusing on deriving scores for specific language components from these essays.

Key Findings: Correlations and Insights

One of the experimental highlights includes the significant correlation between GPT-4 derived scores for specific language components and the holistic scores. This suggests that the analytic abilities of GPT-4 align with human judgments on overall language proficiency expressed in holistic scores. Below are some more detailed observations:

  • Grammatical Accuracy: GPT-4's scores for grammatical accuracy showed a strong correlation with traditional grammatical assessment metrics, underlining its capability to assess syntax and grammar effectively.
  • Vocabulary Analysis: The model's assessment of vocabulary usage and control correlated well with the actual complexity and variation in the learner essays, indicating a nuanced understanding of vocabulary use in context.
  • Coherence and Cohesion: GPT-4 also managed to evaluate how well ideas were structured and connected in essays, a key component for judging essay quality.

Implications and Prospects for Language Learning

Analytic scoring by GPT-4 could revolutionize language learning by providing learners with detailed feedback on specific skills, a task that holistic models might overlook. This capability could lead to more personalized and effective language instruction, helping learners to improve incrementally in particular areas of need. Additionally, these findings hint at the potential for AI to understand and evaluate human language on a nuanced level, which has broader implications for AI's role in educational settings.

Future Directions

Given the promising results with GPT-4, future research could explore more extensive educational applications, such as real-time language skill assessments or integrating these tools into virtual language learning environments. Furthermore, extending this approach to other aspects of language learning, like spoken language assessment, could provide a more comprehensive evaluation tool.

Conclusion

The application of GPT-4 for analytic scoring in language proficiency assessments shows notable promise. It not only adheres to established language frameworks like CEFR but also enhances the feedback quality for language learners. The potential for detailed and scalable language assessments could make AI an indispensable tool in the educational technology landscape, paving the way for more personalized and effective learning methodologies.