Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-grained Linguistic Annotation (2311.04906v1)

Published 26 Sep 2023 in cs.CL and cs.AI

Abstract: Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep linguistic topology of grammar errors, which is critical for interpreting and diagnosing CGEC approaches. To address this limitation, we introduce FlaCGEC, which is a new CGEC dataset featured with fine-grained linguistic annotation. Specifically, we collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually, which results in 10k sentences with 78 instantiated grammar points and 3 types of edits. We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors. In addition, we also treat FlaCGEC as a diagnostic dataset for testing generalization skills and conduct a thorough evaluation of existing CGEC models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hanyue Du (2 papers)
  2. Yike Zhao (2 papers)
  3. Qingyuan Tian (3 papers)
  4. Jiani Wang (10 papers)
  5. Lei Wang (975 papers)
  6. Yunshi Lan (30 papers)
  7. Xuesong Lu (10 papers)
Citations (3)