Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CrossWeigh: Training Named Entity Tagger from Imperfect Annotations (1909.01441v1)

Published 3 Sep 2019 in cs.CL

Abstract: Everyone makes mistakes. So do human annotators when curating labels for named entity recognition (NER). Such label mistakes might hurt model training and interfere model comparison. In this study, we dive deep into one of the widely-adopted NER benchmark datasets, CoNLL03 NER. We are able to identify label mistakes in about 5.38% test sentences, which is a significant ratio considering that the state-of-the-art test F1 score is already around 93%. Therefore, we manually correct these label mistakes and form a cleaner test set. Our re-evaluation of popular models on this corrected test set leads to more accurate assessments, compared to those on the original test set. More importantly, we propose a simple yet effective framework, CrossWeigh, to handle label mistakes during NER model training. Specifically, it partitions the training data into several folds and train independent NER models to identify potential mistakes in each fold. Then it adjusts the weights of training data accordingly to train the final NER model. Extensive experiments demonstrate significant improvements of plugging various NER models into our proposed framework on three datasets. All implementations and corrected test set are available at our Github repo: https://github.com/ZihanWangKi/CrossWeigh.

CrossWeigh: Addressing Imperfect Annotations in Named Entity Recognition

The paper "CrossWeigh: Training Named Entity Tagger from Imperfect Annotations" focuses on the critical issue of label mistakes within benchmark datasets for Named Entity Recognition (NER) tasks, specifically addressing the CoNLL03 NER dataset. This paper reflects on the significant impact these annotation errors can have on both the model evaluation and training processes, considering the already high baseline performance metrics that might mask these discrepancies.

Problem Statement and Contributions

The paper identifies label mistakes within approximately 5.38% of the test sentences in the CoNLL03 dataset, a non-negligible proportion given the context of state-of-the-art performance metrics hovering around an F1 score of 93%. These annotations, when incorrect, challenge both the evaluation standards and the training efficacy, potentially leading to inaccuracies that are propagated through subsequent computational models.

To tackle this, the researchers propose the CrossWeigh framework, a novel approach for accounting for label mistakes during NER training. This framework adopts a two-step approach: mistake estimation and mistake re-weighing.

  1. Mistake Estimation: The training data is partitioned into several folds, using methods reminiscent of k-fold cross-validation. However, a key innovation is ensuring entity disjunction between training and evaluation sets within each fold. This process enables the independent detection of potential labeling errors.
  2. Mistake Re-weighing: The identified potential mistakes have their weights adjusted downward in the final training phase, thereby minimizing their impact on the final model. This mistake-aware training yields improvements in robustness and accuracy across several NER models tested on various datasets.

Experimental Validation

The framework was subject to extensive experimentation across three datasets, including the original and corrected versions of the CoNLL03 dataset. Results consistently demonstrated that integrating CrossWeigh into mainstream NER algorithms (such as LSTM-CRF, VanillaNER, and Flair) enhances their F1 scores and stabilizes performance variability. Notably, it provides a systematic mechanism for addressing label inaccuracies.

Broader Implications

CrossWeigh's impact extends beyond immediate performance improvements in NER tasks. It offers a structured framework for addressing data quality issues, which could be adapted across other domains where label noise is a concern. The approach also introduces a potential method for semi-automated quality enhancement of datasets, especially valuable for evolving datasets with emerging entities or in low-resource language settings.

Future Directions

This research opens up several avenues for future development. One possible trajectory, as discussed by the authors, involves refining CrossWeigh within an iterative framework akin to boosting techniques, allowing for enhanced efficacy in dynamic datasets. Meta-learning strategies could also be employed to accelerate the adaptation of CrossWeigh across varied computational landscapes.

In summary, the paper delivers a meaningful critique and solution to annotation errors in NER, offering a viable pathway for more precise model training and evaluation. By adding robustness to the foundational data, CrossWeigh enhances the reliability and interpretability of NER systems, posing an influential advancement in the field of natural language processing.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zihan Wang (181 papers)
  2. Jingbo Shang (141 papers)
  3. Liyuan Liu (49 papers)
  4. Lihao Lu (1 paper)
  5. Jiacheng Liu (67 papers)
  6. Jiawei Han (263 papers)
Citations (97)
Github Logo Streamline Icon: https://streamlinehq.com