Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Simple Recipe for Multilingual Grammatical Error Correction (2106.03830v2)

Published 7 Jun 2021 in cs.CL

Abstract: This paper presents a simple recipe to train state-of-the-art multilingual Grammatical Error Correction (GEC) models. We achieve this by first proposing a language-agnostic method to generate a large number of synthetic examples. The second ingredient is to use large-scale multilingual LLMs (up to 11B parameters). Once fine-tuned on language-specific supervised sets we surpass the previous state-of-the-art results on GEC benchmarks in four languages: English, Czech, German and Russian. Having established a new set of baselines for GEC, we make our results easily reproducible and accessible by releasing a cLang-8 dataset. It is produced by using our best model, which we call gT5, to clean the targets of a widely used yet noisy lang-8 dataset. cLang-8 greatly simplifies typical GEC training pipelines composed of multiple fine-tuning stages -- we demonstrate that performing a single fine-tuning step on cLang-8 with the off-the-shelf LLMs yields further accuracy improvements over an already top-performing gT5 model for English.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sascha Rothe (16 papers)
  2. Jonathan Mallinson (13 papers)
  3. Eric Malmi (26 papers)
  4. Sebastian Krause (9 papers)
  5. Aliaksei Severyn (29 papers)
Citations (145)