Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains (2401.13129v2)

Published 23 Jan 2024 in cs.CL and cs.SE

Abstract: Accurately typing entity mentions from text segments is a fundamental task for various natural language processing applications. Many previous approaches rely on massive human-annotated data to perform entity typing. Nevertheless, collecting such data in highly specialized science and engineering domains (e.g., software engineering and security) can be time-consuming and costly, without mentioning the domain gaps between training and inference data if the model needs to be applied to confidential datasets. In this paper, we study the task of seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types (i.e., those without seed entities). To solve this problem, we propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus using the contextualized representations of pre-trained LLMs. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types. Extensive experiments on two datasets covering four domains demonstrate the effectiveness of SEType in comparison with various baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Automatic labeling for entity extraction in cyber security. arXiv preprint arXiv:1308.4941.
  2. Prototypical Verbalizer for Prompt-based Few-shot Tuning. In ACL’22, 7014–7024.
  3. From Ultra-Fine to Fine: Fine-tuning Ultra-Fine Entity Typing Models to Fine-grained. In ACL’23, 2259–2270.
  4. Prompt-learning for Fine-grained Entity Typing. In ACL’22, 6888–6901.
  5. Heterogeneous graph attention networks for semi-supervised short text classification. In EMNLP’19, 4821–4830.
  6. Few-shot fine-grained entity typing with automatic label interpretation and instance generation. In KDD’22, 605–614.
  7. Code Recommendation for Open Source Software Developers. In WWW’23, 1324–1333.
  8. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL’20, 7871–7880.
  9. Ultra-fine entity typing with indirect supervision from natural language inference. TACL, 10: 607–622.
  10. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  11. Mining software entities in scientific literature: document-level ner for an extremely imbalance and large-scale task. In CIKM’21, 3986–3995.
  12. Decoupled Weight Decay Regularization. In ICLR’19.
  13. Contextualized Weak Supervision for Text Classification. In ACL’20, 323–333.
  14. Weakly-supervised neural text classification. In CIKM’18, 983–992.
  15. Description-based zero-shot fine-grained entity typing. In NAACL’19, 807–814.
  16. Training language models to follow instructions with human feedback. In NeurIPS’22, 27730–27744.
  17. Ontology Enrichment for Effective Fine-grained Entity Typing. arXiv preprint arXiv:2310.07795.
  18. MS-Mentions: consistently annotating entity mentions in materials science procedural text. In EMNLP’21, 1337–1352.
  19. Glove: Global vectors for word representation. In EMNLP’14, 1532–1543.
  20. Deep Contextualized Word Representations. In NAACL’18, 2227–2237.
  21. Egoset: Exploiting word ego-networks and user-generated ontology for multifaceted set expansion. In WSDM’16, 645–654.
  22. Enabling Analysis and Reasoning on Software Systems through Knowledge Graph Representation. In MSR’23, 120–124.
  23. Automated phrase mining from massive text corpora. IEEE TKDE, 30(10): 1825–1837.
  24. Setexpan: Corpus-based set expansion via context feature selection and rank ensemble. In ECML-PKDD’17, 288–304.
  25. Contrastive estimation: Training log-linear models on unlabeled data. In ACL’05, 354–362.
  26. Code and Named Entity Recognition in StackOverflow. In ACL’20, 4913–4926.
  27. ChemNER: fine-grained chemistry named entity recognition with ontology-guided distant supervision. In EMNLP’21, 5227–5240.
  28. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In NAACL’18, 1112–1122.
  29. Scalable Zero-shot Entity Linking with Dense Entity Retrieval. In EMNLP’20, 6397–6407.
  30. Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-Tuning Pre-Trained Language Models. arXiv preprint arXiv:2108.06590.
  31. Software-specific named entity recognition in software engineering social content. In SANER’16, 90–101.
  32. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In EMNLP’19, 3914–3923.
  33. Corpus-based set expansion with lexical features and distributed representations. In SIGIR’19, 1153–1156.
  34. A technical question answering system with transfer learning. In EMNLP’20, System Demonstrations, 92–99.
  35. Technical Question Answering across Tasks and Domains. In NAACL’21, Industry Papers, 178–186.
  36. Generative Entity Typing with Curriculum Learning. In EMNLP’22, 3061–3073.
  37. Otyper: A neural architecture for open named entity typing. In AAAI’18, 6038–6044.
  38. MZET: Memory Augmented Zero-Shot Fine-grained Named Entity Typing. In COLING’20, 77–87.
  39. Hierarchical metadata-aware document categorization under weak supervision. In WSDM’21, 770–778.
  40. Unsupervised key event detection from massive text corpora. In KDD’22, 2535–2544.
  41. Minimally supervised categorization of text with metadata. In SIGIR’20, 1231–1240.
  42. Empower Entity Set Expansion via Language Model Probing. In ACL’20, 8151–8160.
  43. Higitclass: Keyword-driven hierarchical classification of github repositories. In ICDM’19, 876–885.
  44. Entity Set Co-Expansion in StackOverflow. In IEEE BigData’22, 4792–4795.
  45. Zero-Shot Open Entity Typing as Type-Compatible Grounding. In EMNLP’18, 2065–2076.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yu Zhang (1400 papers)
  2. Yunyi Zhang (39 papers)
  3. Yanzhen Shen (7 papers)
  4. Yu Deng (88 papers)
  5. Lucian Popa (24 papers)
  6. Larisa Shwartz (4 papers)
  7. ChengXiang Zhai (64 papers)
  8. Jiawei Han (263 papers)

Summary

We haven't generated a summary for this paper yet.