Grammatical Error Correction for Code-Switched Sentences by Learners of English (2404.12489v2)
Abstract: Code-switching (CSW) is a common phenomenon among multilingual speakers where multiple languages are used in a single discourse or utterance. Mixed language utterances may still contain grammatical errors however, yet most existing Grammar Error Correction (GEC) systems have been trained on monolingual data and not developed with CSW in mind. In this work, we conduct the first exploration into the use of GEC systems on CSW text. Through this exploration, we propose a novel method of generating synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora. We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints, and identify how they affect the performance of GEC systems on CSW text. Our best model achieves an average increase of 1.57 $F_{0.5}$ across 3 CSW test sets (English-Chinese, English-Korean and English-Japanese) without affecting the model's performance on a monolingual dataset. We furthermore discovered that models trained on one CSW language generalise relatively well to other typologically similar CSW languages.
- Badrul Ahmad and Kamaruzaman Jusoff. 2009. Teachers’ code-switching in classroom instructions for low english proficient learners. English Language Teaching, 2(2):49–55.
- Parallel Iterative Edit Models for Local Sequence Transduction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4260–4270, Hong Kong, China. Association for Computational Linguistics.
- Code Switching and X-Bar Theory: The Functional Head Constraint. Linguistic Inquiry, 25(2):221–237. Publisher: The MIT Press.
- The BEA-2019 Shared Task on Grammatical Error Correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 52–75.
- Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793–805.
- Grammatical error correction: A survey of the state of the art. Computational Linguistics, page 1–59.
- Barbara E. Bullock and Almeida Jacqueline Toribio, editors. 2009. The Cambridge Handbook of Linguistic Code-switching. Cambridge Handbooks in Language and Linguistics. Cambridge University Press, Cambridge.
- Adelia Carstens. 2016. Translanguaging as a vehicle for l2 acquisition and l1 development: students’ perceptions. Language Matters, 47(2):203–222.
- Code-Switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation. In Interspeech 2019, pages 554–558. ISCA.
- CALCS 2021 Shared Task: Machine Translation for Code-Switched Data. ArXiv:2202.09625 [cs].
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In 8th International Conference on Learning Representations.
- Unsupervised Cross-lingual Representation Learning at Scale. ArXiv:1911.02116 [cs].
- Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 22–31.
- Scaffolding to make translanguaging a classroom norm. TESOL Journal, 10(1):e00361.
- Margaret Deuchar. 2020. Code-switching in linguistics: A position paper. Languages, 5(2).
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [cs].
- Mariano Felice and Zheng Yuan. 2014. Generating artificial errors for grammatical error correction. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 116–126.
- Penelope Gardner-Chloros. 2009. Code-switching. Cambridge University Press.
- Generative Adversarial Nets. In Advances in Neural Information Processing Systems, volume 27.
- A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2267–2280, Online. Association for Computational Linguistics.
- DeBERTa: Decoding-enhanced BERT with Disentangled Attention. In 9th International Conference on Learning Representations.
- How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation. ArXiv:2302.09210 [cs].
- Lars Johanson. 1999. The dynamics of code-copying in language encounters. Language encounters across time and space, 3762.
- Multilingual Constituency Parsing with Self-Attention and Pre-Training. ArXiv:1812.11760 [cs].
- Nikita Kitaev and Dan Klein. 2018. Constituency Parsing with a Self-Attentive Encoder. ArXiv:1805.01052 [cs].
- An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1236–1242.
- Linguistically Motivated Parallel Data Augmentation for Code-Switch Language Modeling. In Interspeech, pages 3730–3734.
- Ying Li and Pascale Fung. 2012. Code-Switch Language Model with Inversion Constraints for Mixed Language Speech Recognition. In Proceedings of COLING 2012, pages 1671–1680, Mumbai, India. The COLING 2012 Organizing Committee.
- Corpora Generation for Grammatical Error Correction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3291–3301.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv:1907.11692 [cs].
- Exploring grammatical error correction with not-so-crummy machine translation. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 44–53.
- Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 147–155.
- Pieter Muysken. 2000. Bilingual speech: a typology of code-mixing. Cambridge University Press, Cambridge, UK ; New York.
- Carol Myers-Scotton. 1997. Duelling Languages: Grammatical Structure in Codeswitching. Clarendon Press.
- Carol Myers-Scotton. 2002. Contact linguistics: bilingual encounters and grammatical outcomes. Oxford University Press, Oxford ; New York.
- Carol Myers-Scotton. 2005. Multiple voices: An introduction to bilingualism. John Wiley & Sons.
- Mark Myslín and Roger Levy. 2015. CODE-SWITCHING AND PREDICTABILITY OF MEANING IN DISCOURSE. Language, 91(4):871–905. Publisher: Linguistic Society of America.
- Li Nguyen. 2018. Borrowing or Code-switching? Traces of community norms in Vietnamese-English speech. Australian Journal of Linguistics, 38(4):443–466.
- Li Nguyen. 2021. Cross-Generational Linguistic Variation in the Canberra Vietnamese Heritage Language Community: A Corpus-Centred Investigation. Thesis, University of Cambridge.
- Automatic language identification in code-switched hindi-english social media text. Journal of Open Humanities Data.
- How effective is machine translation on low-resource code-switching? a case study comparing human and automatic metrics. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14186–14195, Toronto, Canada. Association for Computational Linguistics.
- Code-switching input for machine translation: a case study of vietnamese–english data. International Journal of Multilingualism, 0(0):1–22.
- Building Educational Technologies for Code-Switching: Current Practices, Difficulties and Future Directions. Languages, 7(3):220.
- GECToR – Grammatical Error Correction: Tag, Not Rewrite. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 163–170.
- Shana Poplack. 1978. Syntactic Structure and Social Function of Code-switching. Centro de Estudios Puertorriqueños, [City University of New York].
- Shana Poplack. 2018. Borrowing: loanwords in the speech community and in the grammar. Oxford University Press, New York.
- Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1543–1553, Melbourne, Australia. Association for Computational Linguistics.
- Adithya Pratapa and Monojit Choudhury. 2021. Comparing Grammatical Theories of Code-Mixing. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 158–167, Online. Association for Computational Linguistics.
- Artificial Error Generation with Machine Translation and Syntactic Patterns. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 287–292.
- GCM: A Toolkit for Generating Synthetic Code-mixed Text. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 205–211, Online. Association for Computational Linguistics.
- Alla Rozovskaya and Dan Roth. 2010. Generating Confusion Sets for Context-Sensitive Error Correction. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 961–970.
- Vivek Srivastava and Mayank Singh. 2021. HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text. ArXiv:2107.03760 [cs].
- Vivek Srivastava and Mayank Singh. 2022. Overview and Results of MixMT Shared-Task at WMT 2022. Proceedings of the Seventh Conference on Machine Translation (WMT), pages 806–811.
- Felix Stahlberg and Shankar Kumar. 2020. Seq2Edits: Sequence Transduction Using Span-level Edit Operations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5147–5159.
- Felix Stahlberg and Shankar Kumar. 2021. Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 37–47.
- Tense and Aspect Error Correction for ESL Learners Using Global Context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 198–202.
- Jeanine Treffers-Daller. 2023. The simple view of borrowing and code-switching. International Journal of Bilingualism, 0(0):13670069231168535.
- D Wang. 2019. Multilingualism and Translanguaging in Chinese Language Classrooms. Palgrave Macmillan, Basingstoke.
- Max White and Alla Rozovskaya. 2020. A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 198–208.
- Are Multilingual Models Effective in Code-Switching? In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 142–153, Online. Association for Computational Linguistics.
- Jitao Xu and François Yvon. 2021. Can You Traducir This? Machine Translation for Code-Switched Input. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 84–94, Online. Association for Computational Linguistics.
- Erroneous data generation for Grammatical Error Correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 149–158.
- A New Dataset and Method for Automatically Grading ESOL Texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189.
- Neural and FST-based approaches to grammatical error correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 228–239, Florence, Italy. Association for Computational Linguistics.