Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CrossNER: Evaluating Cross-Domain Named Entity Recognition (2012.04373v2)

Published 8 Dec 2020 in cs.CL and cs.AI

Abstract: Cross-domain named entity recognition (NER) models are able to cope with the scarcity issue of NER samples in target domains. However, most of the existing NER benchmarks lack domain-specialized entity types or do not focus on a certain domain, leading to a less effective cross-domain evaluation. To address these obstacles, we introduce a cross-domain NER dataset (CrossNER), a fully-labeled collection of NER data spanning over five diverse domains with specialized entity categories for different domains. Additionally, we also provide a domain-related corpus since using it to continue pre-training LLMs (domain-adaptive pre-training) is effective for the domain adaptation. We then conduct comprehensive experiments to explore the effectiveness of leveraging different levels of the domain corpus and pre-training strategies to do domain-adaptive pre-training for the cross-domain task. Results show that focusing on the fractional corpus containing domain-specialized entities and utilizing a more challenging pre-training strategy in domain-adaptive pre-training are beneficial for the NER domain adaptation, and our proposed method can consistently outperform existing cross-domain NER baselines. Nevertheless, experiments also illustrate the challenge of this cross-domain NER task. We hope that our dataset and baselines will catalyze research in the NER domain adaptation area. The code and data are available at https://github.com/zliucr/CrossNER.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zihan Liu (102 papers)
  2. Yan Xu (258 papers)
  3. Tiezheng Yu (29 papers)
  4. Wenliang Dai (24 papers)
  5. Ziwei Ji (42 papers)
  6. Samuel Cahyawijaya (75 papers)
  7. Andrea Madotto (65 papers)
  8. Pascale Fung (151 papers)
Citations (128)

Summary

  • The paper introduces the CrossNER dataset and a domain-adaptive pre-training process that enhances NER across multiple domains.
  • It details a novel span-level masking technique that enables BERT models to capture broader semantic contexts in domain-specific text.
  • Empirical experiments demonstrate that even with limited domain data, the approach significantly improves NER accuracy and adaptability.

CrossNER: Advancements in Cross-Domain Named Entity Recognition

The paper entitled "CrossNER: Evaluating Cross-Domain Named Entity Recognition" presents a new approach to tackle the challenges of Named Entity Recognition (NER) tasks that require adaptation across multiple domains. The research proposes a novel dataset and methodology dubbed CrossNER, specifically designed to enhance domain-adaptive learning in NER. This paper highlights the importance and practicality of extending NER capabilities into diversified domains, presenting a structured approach towards achieving this in a constrained-resource environment.

Key Contributions and Methodology

The paper's primary contribution is the introduction of the CrossNER dataset. It spans five distinct domains—politics, natural science, music, literature, and artificial intelligence—with meticulously curated entity categories peculiar to each domain. This dataset addresses the inadequacies of previous NER benchmarks that either lacked domain specificity or were ineffective for cross-domain evaluations. The authors detail the comprehensive process of constructing this dataset, emphasizing human annotation backed by automatic pre-annotation methods using resources like the DBpedia Ontology.

Moreover, the authors introduce a domain-adaptive pre-training (DAPT) process based on BERT, which extends standard pre-training to domain-specific data. They explore multiple corpus levels for pre-training and propose novel techniques such as span-level masking. This approach aims to produce a more nuanced understanding of domain-specific contexts within LLMs, as opposed to the simpler token-level masking.

Experimental Insights

Detailed experiments confirm the efficacy of their methods. BERT models, when pre-trained using the proposed DAPT on domain-specific corpora, consistently portray superior performance over traditional cross-domain NER baselines. The experiments suggest that focusing pre-training efforts on text containing a higher density of domain-specific entities—achieved through task- and entity-level corpus—is beneficial for enhancing adaptation.

Span-level pre-training, in particular, was identified as a robust technique, enabling models to learn representations that cover broader semantic contexts compared to token-level masking. Notably, experimental results provide evidence that even when domain-related corpora are limited in size, significant improvements can be achieved.

Practical Implications and Future Directions

The introduction of CrossNER opens new possibilities for applying NER in diverse fields where entity categories are domain-specific. The dataset's availability is poised to encourage further research and development in adapting NER systems to handle domain variability effectively. The methodologies proposed could also extend to real-world applications where training data is scarce but domain adaptability is critical, such as medical text mining or the automated analysis of emerging scientific literature.

Future directions include expanding the corpus to cover more domains and refining the pre-training methods to handle even more specialized contexts. Additionally, the adaptability to languages other than English remains an unexplored area that could broaden the applicability of the CrossNER dataset and associated methodologies.

While the present work has achieved significant strides in cross-domain NER, the journey towards truly domain-agnostic NER models remains a challenging yet promising field of investigation.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com