Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition (2406.11192v1)

Published 17 Jun 2024 in cs.CL

Abstract: Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for LLMs. Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets faces issues due to inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain generalization. To address this, we present B2NERD, a cohesive and efficient dataset for Open NER, normalized from 54 existing English or Chinese datasets using a two-step approach. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly improves LLMs' generalization on Open NER. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Yuming Yang (14 papers)
  2. Wantong Zhao (1 paper)
  3. Caishuang Huang (13 papers)
  4. Junjie Ye (66 papers)
  5. Xiao Wang (507 papers)
  6. Huiyuan Zheng (10 papers)
  7. Yang Nan (40 papers)
  8. Yuran Wang (17 papers)
  9. Xueying Xu (1 paper)
  10. Kaixin Huang (4 papers)
  11. Yunke Zhang (18 papers)
  12. Tao Gui (127 papers)
  13. Qi Zhang (784 papers)
  14. Xuanjing Huang (287 papers)
Citations (1)