Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts (2305.16023v1)

Published 25 May 2023 in cs.CL

Abstract: We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays. To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination. We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data. We further perform detailed analyses of the connections and gaps between our domains from both empirical and statistical views. We hope this work can inspire future studies on an important but under-explored direction--cross-domain GEC.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yue Zhang (618 papers)
  2. Bo Zhang (633 papers)
  3. Haochen Jiang (7 papers)
  4. Zhenghua Li (38 papers)
  5. Chen Li (386 papers)
  6. Fei Huang (408 papers)
  7. Min Zhang (630 papers)
Citations (8)