On the Role of Morphological Information for Contextual Lemmatization (2302.00407v3)
Abstract: Lemmatization is a NLP task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without considering whether that is the optimum in terms of downstream performance. In order to address this issue, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising. It turns out that providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages. In fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain competitive contextual lemmatizers without seeing any explicit morphological signal. Moreover, our experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology and, finally, that current evaluation practices for lemmatization are not adequate to clearly discriminate between models.
- IXA pipeline: Efficient and ready to use multilingual NLP tools, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), Reykjavik, Iceland. pp. 3823–3828. URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/775_Paper.pdf.
- Robust multilingual named entity recognition with shallow semi-supervised features. Artificial Intelligence 238, 63–82. doi:https://doi.org/10.1016/j.artint.2016.05.003.
- Give your text representation models some love: the case for Basque, in: Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France. pp. 4781–4788. URL: https://aclanthology.org/2020.lrec-1.588.
- How (not) to train a dependency parser: The curious case of jackknifing part-of-speech taggers, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Vancouver, Canada. pp. 679–684. URL: https://aclanthology.org/P17-2107, doi:10.18653/v1/P17-2107.
- Sigmorphon 2019 task 2 system description paper: Morphological analysis in context for many languages, with supervision from only a few, in: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Florence, Italy. pp. 87–94. URL: https://aclanthology.org/W19-4211, doi:10.18653/v1/W19-4211.
- Contextual string embeddings for sequence labeling, in: Proceedings of the 27th International Conference on Computational Linguistics, Association for Computational Linguistics, Santa Fe, New Mexico, USA. pp. 1638–1649. URL: https://aclanthology.org/C18-1139.
- From dependencies to constituents in the reference corpus for the processing of Basque. Procesamiento del Lenguaje Natural , 147–154.
- Automatic morphological analysis of Basque. Literary and Linguistic Computing 11, 193–203. doi:https://doi.org/10.1093/llc/11.4.193.
- Tuning multilingual transformers for language-specific named entity recognition, in: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics, Florence, Italy. pp. 89–93. URL: https://aclanthology.org/W19-3712, doi:10.18653/v1/W19-3712.
- What do neural machine translation models learn about morphology?, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Vancouver, Canada. pp. 861–872. doi:10.18653/v1/P17-1080.
- Context sensitive neural lemmatization with Lematus, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana. pp. 1391–1400. URL: https://aclanthology.org/N18-1126, doi:10.18653/v1/N18-1126.
- Memory-based morphological analysis, in: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, College Park, Maryland, USA. pp. 285–292. URL: https://aclanthology.org/P99-1037, doi:10.3115/1034678.1034726.
- FreeLing: An open-source suite of language analyzers, in: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), European Language Resources Association (ELRA), Lisbon, Portugal. pp. 239–242.
- Spanish pre-trained BERT model and evaluation data, in: PML4DC at ICLR 2020.
- Context sensitive lemmatization using two successive bidirectional gated recurrent networks, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Vancouver, Canada. pp. 1481–1491. URL: https://aclanthology.org/P17-1136, doi:10.18653/v1/P17-1136.
- Learning morphology with Morfette, in: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), European Language Resources Association (ELRA), Marrakech, Morocco. URL: http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf.
- Discriminative training methods for Hidden Markov Models: Theory and experiments with perceptron algorithms, in: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), Association for Computational Linguistics. pp. 1–8. doi:10.3115/1118693.1118694.
- Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online. pp. 8440–8451. URL: https://aclanthology.org/2020.acl-main.747, doi:10.18653/v1/2020.acl-main.747.
- What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia. pp. 2126–2136. URL: https://aclanthology.org/P18-1198, doi:10.18653/v1/P18-1198.
- Cross-lingual character-level neural morphological tagging, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark. pp. 748–759. URL: https://aclanthology.org/D17-1078, doi:10.18653/v1/D17-1078.
- MorphNet: A sequence-to-sequence model that combines morphological analysis and disambiguation. CoRR abs/1805.07946.
- BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 4171–4186. URL: https://aclanthology.org/N19-1423, doi:10.18653/v1/N19-1423.
- A two-level morphological analyser and generator for Irish using finite-state transducers, in: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), European Language Resources Association (ELRA), Las Palmas, Canary Islands - Spain.
- Lemmatisation as a tagging task, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Jeju Island, Korea. pp. 368–372. URL: https://aclanthology.org/P12-2072.
- Learning word vectors for 157 languages, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan. URL: https://aclanthology.org/L18-1550.
- An extensive empirical evaluation of character-based morphological tagging for 14 languages, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Association for Computational Linguistics, Valencia, Spain. pp. 505–513. URL: https://aclanthology.org/E17-1048.
- The Czech academic corpus 2.0 guide. The Prague Bulletin of Mathematical Linguistics 89, 41–96.
- Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991.
- Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike, in: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Association for Computational Linguistics, Suntec, Singapore. pp. 145–153. URL: https://aclanthology.org/P09-1017.
- Two-level morphology with composition, in: COLING 1992 Volume 1: The 14th International Conference on Computational Linguistics. URL: https://aclanthology.org/C92-1025.
- Adam: A method for stochastic optimization. CoRR abs/1412.6980.
- UniMorph 2.0: Universal Morphology, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan. URL: https://aclanthology.org/L18-1293.
- Cross-lingual lemmatization and morphology tagging with two-stage multilingual BERT fine-tuning, in: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Florence, Italy. pp. 12–18. URL: https://aclanthology.org/W19-4203, doi:10.18653/v1/W19-4203.
- 75 languages, 1 model: Parsing Universal Dependencies universally, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China. pp. 2779–2795. URL: https://aclanthology.org/D19-1279, doi:10.18653/v1/D19-1279.
- Adaptation of deep bidirectional multilingual transformers for Russian language. CoRR abs/1905.07213.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692.
- Universal Dependencies for Russian: A new syntactic dependencies tagset. SSRN Electronic Journal .
- End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany. pp. 1064–1074. URL: https://aclanthology.org/P16-1101, doi:10.18653/v1/P16-1101.
- A simple joint model for improved contextual neural lemmatization, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 1517–1528. URL: https://aclanthology.org/N19-1155, doi:10.18653/v1/N19-1155.
- Part-of-speech tagging from 97% to 100%: Is it time for some linguistics?, in: Computational Linguistics and Intelligent Text Processing, Springer, Berlin, Heidelberg. pp. 171–189. doi:https://doi.org/10.1007/978-3-642-19400-9_14.
- Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences 117, 30046–30054. doi:https://doi.org/10.1073/pnas.1907367117.
- Universal Stanford dependencies: A cross-linguistic typology, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), Reykjavik, Iceland. pp. 4585–4592. URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1062_Paper.pdf.
- UniMorph 3.0: Universal Morphology, in: Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France. pp. 3922–3931. URL: https://aclanthology.org/2020.lrec-1.483.
- The SIGMORPHON 2019 shared task: Morphological analysis in context and cross-lingual transfer for inflection, in: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Florence, Italy. pp. 229–244. URL: https://aclanthology.org/W19-4226, doi:10.18653/v1/W19-4226.
- Joint lemmatization and morphological tagging with lemming, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal. pp. 2268–2274. URL: https://aclanthology.org/D15-1272, doi:10.18653/v1/D15-1272.
- Universal Dependencies, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts, Association for Computational Linguistics, Valencia, Spain. URL: https://aclanthology.org/E17-5001.
- Two-level description of Turkish morphology, in: Sixth Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Utrecht, The Netherlands. URL: https://aclanthology.org/E93-1066.
- Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures, in: Bański, P., Barbaresi, A., Biber, H., Breiteneder, E., Clematide, S., Kupietz, M., Lüngen, H., Iliadi, C. (Eds.), Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pp. 9–16.
- A fast morphological algorithm with unknown word guessing induced by a dictionary for a Web search engine, in: MLMTA, p. 273.
- A gold standard dependency corpus for English, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), Reykjavik, Iceland. pp. 2897–2904. URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1089_Paper.pdf.
- UDPipe 2.0 prototype at CoNLL 2018 UD shared task, in: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Brussels, Belgium. pp. 197–207. URL: https://aclanthology.org/K18-2020, doi:10.18653/v1/K18-2020.
- UDPipe at SIGMORPHON 2019: Contextualized embeddings, regularization with morphological categories, corpora merging, in: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Florence, Italy. pp. 95–103. URL: https://aclanthology.org/W19-4212, doi:10.18653/v1/W19-4212.
- An analogical learner for morphological analysis, in: Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), Association for Computational Linguistics, Ann Arbor, Michigan. pp. 120–127. URL: https://aclanthology.org/W05-0616.
- Universal Dependencies for Turkish, in: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, The COLING 2016 Organizing Committee, Osaka, Japan. pp. 3444–3454. URL: https://aclanthology.org/C16-1325.
- The Composition and Use of the Universal Morphological Feature Schema (UniMorph schema). Technical Report.
- AnCora: Multilevel annotated corpora for Catalan and Spanish, in: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), European Language Resources Association (ELRA), Marrakech, Morocco. URL: http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdf.
- Parallel data, tools and interfaces in OPUS, in: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey. pp. 2214–2218. URL: http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.
- OPUS – parallel corpora for everyone, in: Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products, Baltic Journal of Modern Computing, Riga, Latvia. URL: https://aclanthology.org/2016.eamt-2.8.
- Turkish treebanking: Unifying and constructing efforts, in: Proceedings of the 13th Linguistic Annotation Workshop, Association for Computational Linguistics, Florence, Italy. pp. 166–177. URL: https://aclanthology.org/W19-4019, doi:10.18653/v1/W19-4019.
- Attention is all you need, in: Advances in neural information processing systems.
- YATO: Yet another deep learning based text analysis open toolkit. arXiv preprint arXiv:2209.13877 .
- Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6.
- Exact hard monotonic attention for character-level transduction, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy. pp. 1530–1537. URL: https://aclanthology.org/P19-1148, doi:10.18653/v1/P19-1148.
- Morpheus: A neural network for jointly learning contextual lemmatization and morphological tagging, in: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Florence, Italy. pp. 25–34. URL: https://aclanthology.org/W19-4205, doi:10.18653/v1/W19-4205.
- The GUM corpus: Creating multilayer resources in the classroom. Language Resources and Evaluation 51, 581–612.
- CoNLL 2017 shared task: Multilingual parsing from raw text to Universal Dependencies, in: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Vancouver, Canada. pp. 1–19. URL: https://aclanthology.org/K17-3001, doi:10.18653/v1/K17-3001.