An Empirical Investigation of Domain Adaptation Ability for Chinese Spelling Check Models (2401.14630v1)
Abstract: Chinese Spelling Check (CSC) is a meaningful task in the area of NLP which aims at detecting spelling errors in Chinese texts and then correcting these errors. However, CSC models are based on pretrained LLMs, which are trained on a general corpus. Consequently, their performance may drop when confronted with downstream tasks involving domain-specific terms. In this paper, we conduct a thorough evaluation about the domain adaption ability of various typical CSC models by building three new datasets encompassing rich domain-specific terms from the financial, medical, and legal domains. Then we conduct empirical investigations in the corresponding domain-specific test datasets to ascertain the cross-domain adaptation ability of several typical CSC models. We also test the performance of the popular LLM ChatGPT. As shown in our experiments, the performances of the CSC models drop significantly in the new domains.
- “Using SMT for OCR error correction of historical texts,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016, pp. 962–966, European Language Resources Association (ELRA).
- “A large scale ranker-based system for search query spelling correction,” in Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, Aug. 2010, pp. 358–366, Coling 2010 Organizing Committee.
- “Effidit: Your AI writing assistant,” CoRR, vol. abs/2208.01815, 2022.
- “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019, pp. 4171–4186, Association for Computational Linguistics.
- “FASPell: A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm,” in Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), Hong Kong, China, Nov. 2019, pp. 160–169, Association for Computational Linguistics.
- “Spelling error correction with soft-masked BERT,” CoRR, vol. abs/2005.07421, 2020.
- “Spellgcn: Incorporating phonological and visual similarities into language models for chinese spelling check,” CoRR, vol. abs/2004.14166, 2020.
- “General and domain adaptive chinese spelling check with error consistent pretraining,” ACM Transactions on Asian and Low-Resource Language Information Processing, sep 2022.
- Piji Li, “uChecker: Masked pretrained language models as unsupervised Chinese spelling checkers,” in Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, Oct. 2022, pp. 2812–2822, International Committee on Computational Linguistics.
- Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation, Chinese Lexical Semantics, 2021.
- “CBLUE: A Chinese biomedical language understanding evaluation benchmark,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, May 2022, pp. 7888–7915, Association for Computational Linguistics.
- “Cail2018: A large-scale legal dataset for judgment prediction,” 2018.
- “Overview of cail2018: Legal judgment prediction competition,” 2018.
- “Chinese spelling check evaluation at SIGHAN bake-off 2013,” in Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, Nagoya, Japan, Oct. 2013, pp. 35–42, Asian Federation of Natural Language Processing.
- “Overview of SIGHAN 2014 bake-off for Chinese spelling check,” in Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, Wuhan, China, Oct. 2014, pp. 126–132, Association for Computational Linguistics.
- “Introduction to SIGHAN 2015 bake-off for Chinese spelling check,” in Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, Beijing, China, July 2015, pp. 32–37, Association for Computational Linguistics.
- Lawrence R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, 1989.
- “Visually and phonologically similar characters in incorrect simplified Chinese words,” in Coling 2010: Posters, Beijing, China, Aug. 2010, pp. 739–747, Coling 2010 Organizing Committee.
- “A hybrid approach to automatic corpus generation for Chinese spelling check,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Oct.-Nov. 2018, pp. 2517–2527, Association for Computational Linguistics.
- Xi Wang (275 papers)
- Ruoqing Zhao (2 papers)
- Hongliang Dai (13 papers)
- Piji Li (75 papers)