Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction (2406.03202v2)

Published 5 Jun 2024 in cs.CL and cs.AI

Abstract: We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named ChatLang-8, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing ChatGPT's data generation capabilities.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jeiyoon Park (5 papers)
  2. Chanjun Park (49 papers)
  3. Heuiseok Lim (49 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com