Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark (2303.13648v1)

Published 15 Mar 2023 in cs.CL

Abstract: ChatGPT is a cutting-edge artificial intelligence LLM developed by OpenAI, which has attracted a lot of attention due to its surprisingly strong ability in answering follow-up questions. In this report, we aim to evaluate ChatGPT on the Grammatical Error Correction(GEC) task, and compare it with commercial GEC product (e.g., Grammarly) and state-of-the-art models (e.g., GECToR). By testing on the CoNLL2014 benchmark dataset, we find that ChatGPT performs not as well as those baselines in terms of the automatic evaluation metrics (e.g., $F_{0.5}$ score), particularly on long sentences. We inspect the outputs and find that ChatGPT goes beyond one-by-one corrections. Specifically, it prefers to change the surface expression of certain phrases or sentence structure while maintaining grammatical correctness. Human evaluation quantitatively confirms this and suggests that ChatGPT produces less under-correction or mis-correction issues but more over-corrections. These results demonstrate that ChatGPT is severely under-estimated by the automatic evaluation metrics and could be a promising tool for GEC.

Evaluation of ChatGPT for Grammatical Error Correction

The paper "ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark" provides a detailed evaluation of ChatGPT's capabilities in the domain of Grammatical Error Correction (GEC). The paper primarily compares ChatGPT with Grammarly, a leading commercial product, and GECToR, a state-of-the-art academic model, utilizing the CoNLL2014 benchmark dataset. This examination is crucial, as ChatGPT is frequently utilized for various NLP tasks, including writing assistance, yet its effectiveness in refining grammatical aspects of text has not been thoroughly scrutinized.

Key Findings

The paper reveals several critical insights regarding ChatGPT's performance in GEC:

  • Performance Metrics: When evaluated on standard metrics such as Precision, Recall, and the F0.5 score, ChatGPT underperforms compared to Grammarly and GECToR. ChatGPT's precision is relatively lower, indicating a higher number of over-corrections, while its recall is notably higher, suggesting an inclination to correct more errors, albeit sometimes excessively.
  • Sentence Structure Modifications: Unlike its counterparts, ChatGPT frequently makes substantial modifications to the sentence structure. This behavior is evidenced by its willingness to rephrase or re-structure sentences while maintaining grammatical accuracy, albeit qualifying as over-corrections by standard metrics.
  • Vulnerability with Long Sentences: ChatGPT's performance diminishes with increasing sentence length, as demonstrated by lower F0.5 scores on longer sentences. This observation underlines a potential scalability limitation in its current deployment for GEC tasks.
  • Human Evaluation: Through manual annotations, the human evaluation dimension reveals that ChatGPT incurs fewer under-corrections or mis-corrections but suffers from higher over-corrections in comparison with other systems. This highlights its robust potential to identify a broader range of errors while also indicating a propensity for generating varied expressions.

Implications and Future Directions

The paper underscores the limitation of relying solely on traditional automatic evaluation metrics for assessing GEC tools. These metrics may not fully capture the qualitative distinctions offered by LLMs like ChatGPT, such as generating grammatically correct, yet reformatted sentence structures. This finding invites a reconsideration of existing evaluation frameworks when applied to cutting-edge models.

Practically, the result suggests ChatGPT could be an asset for applications where creative language reformation is acceptable or even desirable, but less suited for contexts where fidelity to the original sentence structure is paramount. Theoretically, this provokes a broader discussion on how LLMs are pushing the boundaries of traditional NLP task definitions, presenting both challenges and opportunities for the community.

Future research directions could explore more in-depth investigations across diverse datasets and enhance ChatGPT's capabilities using advanced in-context learning methods. The paper also indicates potential utility in developing intricate evaluation metrics that account for the complex outputs produced by models such as ChatGPT.

In conclusion, while ChatGPT does not outperform more specialized systems on standard GEC metrics, it exhibits unique features that suggest unexplored potential within the GEC paradigm. Refining evaluation methods and harnessing LLM's diverse output capabilities remain critical steps for future advancements in this field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haoran Wu (18 papers)
  2. Wenxuan Wang (128 papers)
  3. Yuxuan Wan (28 papers)
  4. Wenxiang Jiao (44 papers)
  5. Michael Lyu (27 papers)
Citations (99)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com