Native Language Identification with Large Language Models (2312.07819v1)
Abstract: We present the first experiments on Native Language Identification (NLI) using LLMs such as GPT-4. NLI is the task of predicting a writer's first language by analyzing their writings in a second language, and is used in second language acquisition and forensic linguistics. Our results show that GPT models are proficient at NLI classification, with GPT-4 setting a new performance record of 91.7% on the benchmark TOEFL11 test set in a zero-shot setting. We also show that unlike previous fully-supervised settings, LLMs can perform NLI without being limited to a set of known classes, which has practical implications for real-world applications. Finally, we also show that LLMs can provide justification for their choices, providing reasoning based on spelling errors, syntactic patterns, and usage of directly translated linguistic patterns.
- Unravelling interlanguage facts via explainable machine learning. Digital Scholarship in the Humanities, 38(3):953–977.
- TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2):i–15.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Native-like expression identification by contrasting native and proficient second language speakers. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5843–5854, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- A deep generative approach to native language identification. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1778–1783, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Shervin Malmasi. 2016. Native Language Identification: Explorations and Applications. Ph.D. thesis, Macquarie University.
- Shervin Malmasi and Aoife Cahill. 2015. Measuring feature diversity in native language identification. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 49–55, Denver, Colorado. Association for Computational Linguistics.
- Shervin Malmasi and Mark Dras. 2014a. Arabic native language identification. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pages 180–186, Doha, Qatar. Association for Computational Linguistics.
- Shervin Malmasi and Mark Dras. 2014b. Chinese native language identification. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 95–99, Gothenburg, Sweden. Association for Computational Linguistics.
- Shervin Malmasi and Mark Dras. 2014c. Language transfer hypotheses with linear SVM weights. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1385–1390, Doha, Qatar. Association for Computational Linguistics.
- Shervin Malmasi and Mark Dras. 2015. Large-scale native language identification with cross-corpus evaluation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1403–1409, Denver, Colorado. Association for Computational Linguistics.
- Shervin Malmasi and Mark Dras. 2017. Multilingual native language identification. Natural Language Engineering, 23(2):163–215.
- Shervin Malmasi and Mark Dras. 2018. Native language identification with classifier stacking and ensembles. Computational Linguistics, 44(3):403–446.
- A report on the 2017 native language identification shared task. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 62–75, Copenhagen, Denmark. Association for Computational Linguistics.
- Oracle and human baselines for native language identification. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 172–178, Denver, Colorado. Association for Computational Linguistics.
- NLI shared task 2013: MQ submission. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 124–133, Atlanta, Georgia. Association for Computational Linguistics.
- OpenAI. 2023. Gpt-4 technical report.
- Stian Steinbakken and Björn Gambäck. 2020. Native-language identification with attention. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 261–271, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
- Ahmet Yavuz Uluslu and Gerold Schneider. 2022. Scaling native language identification with transformer adapters. In Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022), pages 298–302, Trento, Italy. Association for Computational Linguistics.