ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark (2303.13648v1)

Published 15 Mar 2023 in cs.CL

Abstract: ChatGPT is a cutting-edge artificial intelligence LLM developed by OpenAI, which has attracted a lot of attention due to its surprisingly strong ability in answering follow-up questions. In this report, we aim to evaluate ChatGPT on the Grammatical Error Correction(GEC) task, and compare it with commercial GEC product (e.g., Grammarly) and state-of-the-art models (e.g., GECToR). By testing on the CoNLL2014 benchmark dataset, we find that ChatGPT performs not as well as those baselines in terms of the automatic evaluation metrics (e.g., $F_{0.5}$ score), particularly on long sentences. We inspect the outputs and find that ChatGPT goes beyond one-by-one corrections. Specifically, it prefers to change the surface expression of certain phrases or sentence structure while maintaining grammatical correctness. Human evaluation quantitatively confirms this and suggests that ChatGPT produces less under-correction or mis-correction issues but more over-corrections. These results demonstrate that ChatGPT is severely under-estimated by the automatic evaluation metrics and could be a promising tool for GEC.

PDF Abstract

Evaluation of ChatGPT for Grammatical Error Correction

The paper "ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark" provides a detailed evaluation of ChatGPT's capabilities in the domain of Grammatical Error Correction (GEC). The paper primarily compares ChatGPT with Grammarly, a leading commercial product, and GECToR, a state-of-the-art academic model, utilizing the CoNLL2014 benchmark dataset. This examination is crucial, as ChatGPT is frequently utilized for various NLP tasks, including writing assistance, yet its effectiveness in refining grammatical aspects of text has not been thoroughly scrutinized.

Key Findings

The paper reveals several critical insights regarding ChatGPT's performance in GEC:

Performance Metrics: When evaluated on standard metrics such as Precision, Recall, and the F0.5 score, ChatGPT underperforms compared to Grammarly and GECToR. ChatGPT's precision is relatively lower, indicating a higher number of over-corrections, while its recall is notably higher, suggesting an inclination to correct more errors, albeit sometimes excessively.
Sentence Structure Modifications: Unlike its counterparts, ChatGPT frequently makes substantial modifications to the sentence structure. This behavior is evidenced by its willingness to rephrase or re-structure sentences while maintaining grammatical accuracy, albeit qualifying as over-corrections by standard metrics.
Vulnerability with Long Sentences: ChatGPT's performance diminishes with increasing sentence length, as demonstrated by lower F0.5 scores on longer sentences. This observation underlines a potential scalability limitation in its current deployment for GEC tasks.
Human Evaluation: Through manual annotations, the human evaluation dimension reveals that ChatGPT incurs fewer under-corrections or mis-corrections but suffers from higher over-corrections in comparison with other systems. This highlights its robust potential to identify a broader range of errors while also indicating a propensity for generating varied expressions.

Implications and Future Directions

The paper underscores the limitation of relying solely on traditional automatic evaluation metrics for assessing GEC tools. These metrics may not fully capture the qualitative distinctions offered by LLMs like ChatGPT, such as generating grammatically correct, yet reformatted sentence structures. This finding invites a reconsideration of existing evaluation frameworks when applied to cutting-edge models.

Practically, the result suggests ChatGPT could be an asset for applications where creative language reformation is acceptable or even desirable, but less suited for contexts where fidelity to the original sentence structure is paramount. Theoretically, this provokes a broader discussion on how LLMs are pushing the boundaries of traditional NLP task definitions, presenting both challenges and opportunities for the community.

Future research directions could explore more in-depth investigations across diverse datasets and enhance ChatGPT's capabilities using advanced in-context learning methods. The paper also indicates potential utility in developing intricate evaluation metrics that account for the complex outputs produced by models such as ChatGPT.

In conclusion, while ChatGPT does not outperform more specialized systems on standard GEC metrics, it exhibits unique features that suggest unexplored potential within the GEC paradigm. Refining evaluation methods and harnessing LLM's diverse output capabilities remain critical steps for future advancements in this field.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Haoran Wu (18 papers)
Wenxuan Wang (128 papers)
Yuxuan Wan (28 papers)
Wenxiang Jiao (44 papers)
Michael Lyu (27 papers)

Citations (99)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

YouTube

Show All Videos