Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Named Entity Tagger using Domain-Specific Dictionary (1809.03599v1)

Published 10 Sep 2018 in cs.CL

Abstract: Recent advances in deep neural models allow us to build reliable named entity recognition (NER) systems without handcrafting features. However, such methods require large amounts of manually-labeled training data. There have been efforts on replacing human annotations with distant supervision (in conjunction with external dictionaries), but the generated noisy labels pose significant challenges on learning effective neural models. Here we propose two neural models to suit noisy distant supervision from the dictionary. First, under the traditional sequence labeling framework, we propose a revised fuzzy CRF layer to handle tokens with multiple possible labels. After identifying the nature of noisy labels in distant supervision, we go beyond the traditional framework and propose a novel, more effective neural model AutoNER with a new Tie or Break scheme. In addition, we discuss how to refine distant supervision for better NER performance. Extensive experiments on three benchmark datasets demonstrate that AutoNER achieves the best performance when only using dictionaries with no additional human effort, and delivers competitive results with state-of-the-art supervised benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jingbo Shang (141 papers)
  2. Liyuan Liu (49 papers)
  3. Xiang Ren (194 papers)
  4. Xiaotao Gu (32 papers)
  5. Teng Ren (2 papers)
  6. Jiawei Han (263 papers)
Citations (204)

Summary

An Insightful Overview of "Learning Named Entity Tagger using Domain-Specific Dictionary"

The paper "Learning Named Entity Tagger using Domain-Specific Dictionary" presents an innovative approach to Named Entity Recognition (NER) utilizing only domain-specific dictionaries. The authors address the limitations of traditional supervised NER, which relies heavily on large, annotated datasets that may be cumbersome to generate, especially in specialized domains. This work aims to alleviate such challenges by leveraging external dictionaries to automatically generate training data through distant supervision.

Key Contributions and Methodology

The paper introduces two novel neural architectures tailored for managing the noisy labels inherent in distant supervision. These architectures are designed to function effectively while using a dictionary as the sole data source for training.

  1. Fuzzy-LSTM-CRF Model:
    • The first proposed model is a modification of the LSTM-CRF architecture, dubbed Fuzzy-LSTM-CRF. It incorporates a fuzzy CRF layer to accommodate the multi-label nature of dictionary-supervised training. The model adopts a modified IOBES tagging scheme enabling tokens to have multiple labels.
    • This architecture is optimized to handle the uncertain boundaries of entities produced by distant supervision.
  2. AutoNER with Tie or Break Scheme:
    • Moving beyond traditional sequence labeling, AutoNER employs a novel "Tie or Break" scheme. This method focuses on predicting whether adjacent tokens are part of the same entity or separate, enhancing robustness against noisy labels.
    • The separation of entity span detection from type prediction distinguishes this model, providing improved noise resilience compared to standard CRF-based models.

Experimental Evaluation

The authors' rigorous experimentation on three benchmark datasets—BC5CDR, NCBI-Disease, and LaptopReview—demonstrates that AutoNER significantly outperforms other distantly supervised models such as SwellShark and Distant-LSTM-CRF. Notably, AutoNER achieves results competitive with state-of-the-art supervised benchmarks, without requiring any additional human annotation beyond the initial dictionary.

Refinement Techniques

To further enhance performance, the authors propose two refinement techniques:

  • Corpus-Aware Dictionary Tailoring: This strategy decreases false positives by restricting the dictionary to entities whose canonical names appear in the corpus, thus maintaining a balance between precision and recall.
  • Incorporation of High-Quality Unknown Phrases: By integrating high-quality, out-of-dictionary phrases identified through phrase mining, the model reduces false negatives and enhances type prediction accuracy.

Implications and Future Directions

The practical implications of this research are profound, enabling scalable NER across domains with minimal manual intervention. Theoretically, it challenges traditional reliance on heavily annotated data by demonstrating the viability of dictionary-based distant supervision.

Future work proposed includes extending these techniques to multilingual contexts and exploring nested and multi-typed entity recognition. Additionally, the framework's adaptability to other sequence labeling tasks could provide significant advancements.

In conclusion, this paper provides a sophisticated and efficient approach to NER in resource-constrained scenarios, showing promise for broader AI applications where labeled data is sparse or unavailable.

Github Logo Streamline Icon: https://streamlinehq.com