BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision (2006.15509v1)

Published 28 Jun 2020 in cs.CL, cs.AI, and cs.LG

Abstract: We study the open-domain named entity recognition (NER) problem under distant supervision. The distant supervision, though does not require large amounts of manual annotations, yields highly incomplete and noisy distant labels via external knowledge bases. To address this challenge, we propose a new computational framework -- BOND, which leverages the power of pre-trained LLMs (e.g., BERT and RoBERTa) to improve the prediction performance of NER models. Specifically, we propose a two-stage training algorithm: In the first stage, we adapt the pre-trained LLM to the NER tasks using the distant labels, which can significantly improve the recall and precision; In the second stage, we drop the distant labels, and propose a self-training approach to further improve the model performance. Thorough experiments on 5 benchmark datasets demonstrate the superiority of BOND over existing distantly supervised NER methods. The code and distantly labeled data have been released in https://github.com/cliang1453/BOND.

PDF Abstract

BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision

This paper addresses the open-domain named entity recognition (NER) problem under the challenging setup of distant supervision. The distant supervision approach attempts to mitigate the inherent demand for manual annotations by utilizing existing knowledge bases (KBs) to generate labels automatically. While efficient, this method suffers from high levels of noise and incomplete annotations due to the limited coverage and ambiguity in knowledge bases, which negatively affects the performance of downstream NER models.

Contribution and Methodology

The authors introduce a novel computational framework, labeled "", which specifically enhances NER models by leveraging pre-trained LLMs such as BERT and RoBERTa. The methodology unfolds in two distinct stages:

Adapting Pre-trained Models with Distant Labels: The initial stage fine-tunes pre-trained LLMs on distantly labeled data. This process transfers general semantic knowledge from models like BERT into the NER task, drastically improving the recall and precision over distant label annotations. An early stopping technique is utilized here to curtail overfitting due to the noisy input labels.
Self-training with Pseudo-labels: In the subsequent phase, the authors drop the distant labels in favor of a self-training scheme. Here, pseudo soft-labels are generated from the model itself and leveraged in a teacher-student framework to refine NER predictions iteratively. This step is designed to alleviate the limitations of noisy and incomplete data further and to convert the pre-trained knowledge into better domain-specific insights.

Results and Analysis

Thorough experimental evaluations on five benchmark datasets demonstrate a significant performance enhancement over existing distantly-supervised NER techniques. For instance, the framework outpaces state-of-the-art models by margins up to 21.91% in F1 scores. Such numerical results underscore the efficacy of combining distant supervision with pre-trained LLMs in handling open-domain NER tasks.

Theoretical and Practical Implications

The presented framework not only shows improvements in addressing the inherent challenges tied to distant supervision—like label noise and coverage—but it also demonstrates the potent flexibility and utility of pre-trained LLMs in specialized tasks. The method propagates robust mechanisms for improving model confidence and data utilization through pseudo-labels, which are enhanced by filtering based on prediction confidence.

Broader Impact and Future Directions

The implications of this work suggest substantial potential in scaling NER tasks across unsupervised settings, enabling knowledge extraction in domains lacking comprehensive labeling. The proposed method's steps provide foundational learning that could synergize with advancements in even larger and more expressive LLMs. Future research could explore fine-tuning the re-initialization methods further, integrating alternative teacher-student strategies, or deploying the method across multilingual text, further mitigating the barriers posed by label scarcity in diverse linguistic environments.

In summary, offers a well-conceived, thoroughly evaluated approach to extending the capabilities of existing NER models under the constraints of limited, noisy data, setting a new pathway for open-domain NER methodologies.