Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Grammatical Error Correction via Contextual Data Augmentation (2406.17456v1)

Published 25 Jun 2024 in cs.CL and cs.AI

Abstract: Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine-tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generative model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yixuan Wang (95 papers)
  2. Baoxin Wang (15 papers)
  3. Yijun Liu (23 papers)
  4. Qingfu Zhu (42 papers)
  5. Dayong Wu (16 papers)
  6. Wanxiang Che (155 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.