DANSK and DaCy 2.6.0: Domain Generalization of Danish Named Entity Recognition (2402.18209v1)
Abstract: Named entity recognition is one of the cornerstones of Danish NLP, essential for language technology applications within both industry and research. However, Danish NER is inhibited by a lack of available datasets. As a consequence, no current models are capable of fine-grained named entity recognition, nor have they been evaluated for potential generalizability issues across datasets and domains. To alleviate these limitations, this paper introduces: 1) DANSK: a named entity dataset providing for high-granularity tagging as well as within-domain evaluation of models across a diverse set of domains; 2) DaCy 2.6.0 that includes three generalizable models with fine-grained annotation; and 3) an evaluation of current state-of-the-art models' ability to generalize across domains. The evaluation of existing and new models revealed notable performance discrepancies across domains, which should be addressed within the field. Shortcomings of the annotation quality of the dataset and its impact on model training and evaluation are also discussed. Despite these limitations, we advocate for the use of the new dataset DANSK alongside further work on the generalizability within Danish NER.
- Unsupervised cross-lingual representation learning at scale.
- The linguistic annotation system of the stockholm-umeå project.
- DaCy: A unified framework for danish NLP.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92.
- Yoav Goldberg and Joakim Nivre. 2013. Training deterministic parsers with non-deterministic oracles. Transactions of the association for Computational Linguistics, 1:403–414.
- Manual of the stockholm umeå corpus version 2.0. pages 5–85.
- spaCy: Industrial-strength natural language processing in python. Publisher: Zenodo, Honolulu, HI, USA.
- Dane: A named entity resource for danish. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4597–4604.
- DaNE: A named entity resource for danish. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4597–4604.
- Universal dependencies for danish. In International Workshop on Treebanks and Linguistic Theories (TLT14), page 157.
- NorNE: Annotating named entities for norwegian.
- The lacunae of danish natural language processing. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 356–362.
- Datasets: A community library for natural language processing. arXiv preprint arXiv:2109.02846.
- A unified MRC framework for named entity recognition.
- Fine-grained named entity annotation for finnish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 135–144.
- Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229.
- Ines Montani and Matthew Honnibal. 2018. Prodigy: A new annotation tool for radically efficient machine teaching. Artificial Intelligence to appear.
- Dan Saattrup Nielsen. ScandEval: A benchmark for scandinavian natural language processing.
- Ole & Asmussen Jorg Norling-Christensen. The corpus of the danish dictionary. 8(8):223–242. Publisher: Bureau of the WAT.
- Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958.
- Barbara Plank. The ’problem’ of human label variation: On ground truth in data, modeling and evaluation.
- DaN+: Danish nested named entities and lexical normalization.
- More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis. 41:647–656. Publisher: Springer.
- Imagenet large scale visual recognition challenge. 115:211–252. Publisher: Springer.
- Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition.
- Transfer to a low-resource language via close relatives: The case study on faroese. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), Tórshavn, Faroe Islands. Linköping University Electronic Press, Sweden.
- The danish gigaword corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 413–421. Linköping University Electronic Press, Sweden.
- SemEval-2021 task 12: Learning with disagreements. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 338–347. Association for Computational Linguistics.
- T-NER: An all-round python library for transformer-based named entity recognition. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 53–62. Association for Computational Linguistics.
- OntoNotes release 5.0.