A Unified Taxonomy-Guided Instruction Tuning Framework for Entity Set Expansion and Taxonomy Expansion (2402.13405v4)
Abstract: Entity set expansion, taxonomy expansion, and seed-guided taxonomy construction are three representative tasks that can be applied to automatically populate an existing taxonomy with emerging concepts. Previous studies view them as three separate tasks. Therefore, their proposed techniques usually work for one specific task only, lacking generalizability and a holistic perspective. In this paper, we aim at a unified solution to the three tasks. To be specific, we identify two common skills needed for entity set expansion, taxonomy expansion, and seed-guided taxonomy construction: finding "siblings" and finding "parents". We propose a taxonomy-guided instruction tuning framework to teach a LLM to generate siblings and parents for query entities, where the joint pre-training process facilitates the mutual enhancement of the two skills. Extensive experiments on multiple benchmark datasets demonstrate the efficacy of our proposed TaxoInstruct framework, which outperforms task-specific baselines across all three tasks.
- Scibert: A pretrained language model for scientific text. In EMNLP’19, pages 3615–3620.
- Semeval-2016 task 13: Taxonomy extraction evaluation (texeval-2). In SemEval’16, pages 1081–1091.
- Specter: Document-level representation learning using citation-informed transformers. arXiv preprint arXiv:2004.07180.
- Margaret H Coletti and Howard L Bleich. 2001. Medical subject headings used to search the biomedical literature. JAMIA, 8(4):317–323.
- Comparative toxicogenomics database (ctd): update 2023. Nucleic acids research, 51(D1):D1257–D1262.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT’19, pages 4171–4186.
- Learning semantic hierarchies via word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1199–1209.
- Use of owl and semantic web technologies at pinterest. In ISWC’19, pages 418–435.
- Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking. arXiv preprint arXiv:2305.15066.
- Explanations as features: Llm-based features for text-attributed graphs. arXiv preprint arXiv:2305.19523.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Taxoenrich: Self-supervised taxonomy completion via structure-semantic representations. In Proceedings of the ACM Web Conference 2022, pages 925–934.
- A single vector is not enough: Taxonomy expansion via box embeddings. In WWW’23, pages 2467–2476.
- Temp: taxonomy expansion with dynamic margin loss through taxonomy-paths. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3854–3863.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Chain-of-skills: A configurable model for open-domain question answering. In ACL’23, pages 1599–1618.
- Term set expansion based nlp architect by intel ai lab. arXiv preprint arXiv:1808.08953.
- Octet: Online catalog taxonomy enrichment with self-supervision. In KDD’20, pages 2247–2257.
- Taxi at semeval-2016 task 13: a taxonomy induction method based on lexico-syntactic patterns, substrings and focused crawling. In SemEval’16, pages 1320–1327.
- Disentangled representation learning with large language models for text-attributed graphs. arXiv preprint arXiv:2310.18152.
- Weakly-supervised relation extraction by pattern-enhanced embedding learning. In WWW’18, pages 1257–1266.
- Egoset: Exploiting word ego-networks and user-generated ontology for multifaceted set expansion. In WSDM’16, pages 645–654.
- Taxoexpan: Self-supervised taxonomy expansion with position-enhanced graph neural network. In WWW’20, pages 486–497.
- Setexpan: Corpus-based set expansion via context feature selection and rank ensemble. In ECML-PKDD’17, pages 288–304.
- Hiexpan: Task-guided taxonomy construction by hierarchical tree expansion. In KDD’18, pages 2180–2189.
- Entity set search of scientific literature: An unsupervised ranking approach. In SIGIR’18, pages 565–574.
- A web-scale system for scientific knowledge exploration. In ACL’18 System Demonstrations, pages 87–92.
- Improving hypernymy detection with an integrated path-based and distributional method. In ACL’16, pages 2389–2398.
- Learning syntactic patterns for automatic hypernym discovery. Advances in neural information processing systems, 17.
- Code and named entity recognition in stackoverflow. In ACL’20, pages 4913–4926.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Probabilistic embedding of knowledge graphs with box lattice measures. In ACL’18, pages 263–272.
- Covid-19 literature knowledge graph construction and drug repurposing report generation. In NAACL’21 System Demonstrations, pages 66–77.
- Richard C Wang and William W Cohen. 2007. Language-independent set expansion of named entities using the web. In ICDM’07, pages 342–350.
- Finetuned language models are zero-shot learners. In ICLR’22.
- Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In ACL’94, pages 133–138.
- Taxoprompt: A prompt-based generation method with taxonomic context for self-supervised taxonomy expansion.
- Natural language is all a graph needs. arXiv preprint arXiv:2308.07134.
- Steam: Self-supervised taxonomy expansion with mini-paths. In KDD’20, pages 1026–1035.
- Making large language models perform better in knowledge graph completion. arXiv preprint arXiv:2310.06671.
- The effect of metadata on scientific literature tagging: A cross-field cross-model study. In WWW’23, pages 1626–1637.
- Entity set co-expansion in stackoverflow. In IEEE BigData’22, pages 4792–4795.
- Empower entity set expansion via language model probing. In ACL’20, pages 8151–8160.