Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Organic Data-Driven Approach for Turkish Grammatical Error Correction and LLMs (2405.15320v1)

Published 24 May 2024 in cs.CL and cs.AI

Abstract: Grammatical Error Correction has seen significant progress with the recent advancements in deep learning. As those methods require huge amounts of data, synthetic datasets are being built to fill this gap. Unfortunately, synthetic datasets are not organic enough in some cases and even require clean data to start with. Furthermore, most of the work that has been done is focused mostly on English. In this work, we introduce a new organic data-driven approach, clean insertions, to build parallel Turkish Grammatical Error Correction datasets from any organic data, and to clean the data used for training LLMs. We achieve state-of-the-art results on two Turkish Grammatical Error Correction test sets out of the three publicly available ones. We also show the effectiveness of our method on the training losses of training LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Detecting clitics related orthographic errors in turkish. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 71–76.
  2. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  3. Lfg-based features for noun number and article grammatical errors. Association for Computational Linguistics.
  4. Correcting esl errors using phrasal smt techniques. In 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, Australia.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. The bea-2019 shared task on grammatical error correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 52–75.
  7. Grammatical error correction: A survey of the state of the art. Computational Linguistics, 49(3):643–701.
  8. Automatic annotation and evaluation of error types for grammatical error correction. Association for Computational Linguistics.
  9. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  10. Detection of grammatical errors involving prepositions. In Proceedings of the fourth ACL-SIGSEM workshop on prepositions, pages 25–30.
  11. Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 568–572.
  12. Building a large annotated corpus of learner english: The nus corpus of learner english. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications, pages 22–31.
  13. Nava Ehsan and Heshaam Faili. 2013. Grammatical and context-sensitive error correction using a statistical machine translation framework. Software: Practice and Experience, 43(2):187–206.
  14. Automatic extraction of learner errors in esl sentences using linguistically enhanced alignments. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 825–835.
  15. Mariano Felice and Zheng Yuan. 2014. Generating artificial errors for grammatical error correction. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 116–126.
  16. Grammatical error correction in low error density domains: A new benchmark and analyses. arXiv preprint arXiv:2010.07574.
  17. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344.
  18. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120.
  19. Approaching neural grammatical error correction as a low-resource machine translation task. arXiv preprint arXiv:1804.05940.
  20. Gecturk: Grammatical error correction and detection dataset for turkish. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), pages 278–290.
  21. An empirical study of incorporating pseudo data into grammatical error correction. arXiv preprint arXiv:1909.00502.
  22. #turki$hTweets: A benchmark dataset for Turkish text correction. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4190–4198, Online. Association for Computational Linguistics.
  23. John SY Lee. 2004. Automatic article restoration. In Proceedings of the Student Research Workshop at HLT-NAACL 2004, pages 31–36.
  24. John SY Lee and Stephanie Seneff. 2008. Correcting misuse of verb forms. In Proceedings of ACL-08: HLT, pages 174–182.
  25. Building a tocfl learner corpus for chinese grammatical error diagnosis. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  26. Corpora generation for grammatical error correction. arXiv preprint arXiv:1904.05780.
  27. Mining revision log of language learning sns for automated japanese error correction of second language learners. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 147–155.
  28. The first qalb shared task on automatic text correction for arabic. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pages 39–47.
  29. Daniel Naber et al. 2003. A rule-based style and grammar checker.
  30. Jfleg: A fluency corpus and benchmark for grammatical error correction. arXiv preprint arXiv:1702.04066.
  31. GECToR – grammatical error correction: Tag, not rewrite. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 163–170, Seattle, WA, USA → Online. Association for Computational Linguistics.
  32. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
  34. Artificial error generation with machine translation and syntactic patterns. arXiv preprint arXiv:1707.05236.
  35. A simple recipe for multilingual grammatical error correction. arXiv preprint arXiv:2106.03830.
  36. Alla Rozovskaya and Dan Roth. 2010. Training paradigms for correcting errors in grammar and usage. In Human language technologies: The 2010 annual conference of the north american chapter of the association for computational linguistics, pages 154–162.
  37. Stefan Schweter. 2020. Berturk - bert models for turkish.
  38. Ensembling and knowledge distilling of large sequence taggers for grammatical error correction. arXiv preprint arXiv:2203.13064.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  40. Harun Uz and Gülşen Eryiğit. 2023. Towards automatic grammatical error type classification for Turkish. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 134–142, Dubrovnik, Croatia. Association for Computational Linguistics.
  41. Attention is all you need. Advances in neural information processing systems, 30.
  42. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  43. Noising and denoising natural language: Diverse backtranslation for grammar correction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 619–628.
  44. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
  45. Helen Yannakoudakis and Ted Briscoe. 2012. Modeling coherence in esol learner texts. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 33–43.
  46. Zheng Yuan and Ted Briscoe. 2016. Grammatical error correction using neural machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 380–386.
  47. Zheng Yuan and Mariano Felice. 2013. Constrained grammatical error correction using statistical machine translation. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 52–61.
  48. Neural and fst-based approaches to grammatical error correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 228–239.
  49. Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. arXiv preprint arXiv:1903.00138.
  50. Can chatgpt reproduce human-generated labels? a study of social computing tasks. arXiv preprint arXiv:2304.10145.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Asım Ersoy (3 papers)
  2. Olcay Taner Yıldız (1 paper)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets