Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction (2404.03868v2)
Abstract: In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on LLMs has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that, in prior methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schemas easily exceed the LLMs' context window length. Furthermore, there are scenarios where a fixed pre-defined schema is not available and we would like the method to construct a high-quality KG with a succinct self-generated schema. To address these problems, we propose a three-phase framework named Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization. EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not; in the latter case, it constructs a schema automatically and applies self-canonicalization. To further improve performance, we introduce a trained component that retrieves schema elements relevant to the input text; this improves the LLMs' extraction performance in a retrieval-augmented generation-like manner. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works. Code for EDC is available at https://github.com/clear-nus/edc.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. arXiv preprint arXiv:2010.12688, 2020.
- Codekgc: Code language model for generative knowledge graph construction. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(3):1–16, 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Rebel: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2370–2381, 2021.
- Ultra-fine entity typing. arXiv preprint arXiv:1807.04905, 2018.
- Open knowledge graphs canonicalization using variational autoencoders. arXiv preprint arXiv:2012.04780, 2020.
- Neural relation extraction for knowledge base enrichment. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 229–240, 2019.
- Regen: Reinforcement learning for text and knowledge base generation using pretrained language models. arXiv preprint arXiv:2108.12472, 2021.
- The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020). In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), 2020.
- Word embedding based generalized language model for information retrieval. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 795–798, 2015.
- Ppdb: The paraphrase database. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp. 758–764, 2013.
- An automatic machining process decision-making system based on knowledge graph. International journal of computer integrated manufacturing, 34(12):1348–1369, 2021.
- A survey on knowledge graph-based recommender systems. IEEE Transactions on Knowledge and Data Engineering, 34(8):3549–3568, 2020.
- Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of biomedical informatics, 45(5):885–892, 2012.
- Is information extraction solved by chatgpt? an analysis of performance, evaluation criteria, robustness and errors. arXiv preprint arXiv:2305.14450, 2023.
- Knowledge graph embedding based question answering. In Proceedings of the twelfth ACM international conference on web search and data mining, pp. 105–113, 2019.
- A survey on knowledge graphs: Representation, acquisition, and applications. IEEE transactions on neural networks and learning systems, 33(2):494–514, 2021.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- GenIE: Generative information extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4626–4643, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.342. URL https://aclanthology.org/2022.naacl-main.342.
- Open information extraction: A review of baseline techniques, approaches, and applications. arXiv preprint arXiv:2310.11644, 2023.
- Openie6: Iterative grid labeling and coordination analysis for open information extraction. arXiv preprint arXiv:2010.03147, 2020.
- A new complex fuzzy inference system with fuzzy knowledge graph and extensions in decision making. Ieee Access, 8:164899–164921, 2020.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. arXiv preprint arXiv:2304.11633, 2023.
- Open information extraction from 2007 to 2022–a survey. arXiv preprint arXiv:2208.08690, 2022.
- Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. arXiv preprint arXiv:1808.09602, 2018.
- Joint learning of named entity recognition and entity linking. arXiv preprint arXiv:1907.08243, 2019.
- Knowledge graph generation from text. arXiv preprint arXiv:2211.10511, 2022.
- George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- Fine-grained entity typing for domain independent entity linking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 8576–8583, 2020.
- Aligning open ie relations and kb relations using a siamese network based on word embedding. In Proceedings of the 13th International Conference on Computational Semantics-Long Papers, pp. 142–153, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- A linear programming formulation for global inference in natural language tasks. In Proceedings of the eighth conference on computational natural language learning (CoNLL-2004) at HLT-NAACL 2004, pp. 1–8, 2004.
- Relation extraction using distant supervision: A survey. ACM Computing Surveys (CSUR), 51(5):1–35, 2018.
- Cesi: Canonicalizing open knowledge bases using embeddings and side information. In Proceedings of the 2018 World Wide Web Conference, pp. 1317–1327, 2018.
- Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85, 2014.
- Revisiting relation extraction in the era of large language models. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2023, pp. 15566. NIH Public Access, 2023.
- Knowledge graph convolutional networks for recommender systems. In The world wide web conference, pp. 3307–3313, 2019.
- Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
- Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:2302.10205, 2023.
- Qa-gnn: Reasoning with language models and knowledge graphs for question answering. arXiv preprint arXiv:2104.06378, 2021.
- Generative knowledge graph construction: A review. arXiv preprint arXiv:2210.12714, 2022.
- Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp. 2335–2344, 2014.
- Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1753–1762, 2015.
- A comprehensive survey on automatic knowledge graph construction. ACM Computing Surveys, 56(4):1–62, 2023.
- A survey on neural open information extraction: Current status and future directions. arXiv preprint arXiv:2205.11725, 2022.
- Named entity recognition with parallel recurrent neural networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 69–74, 2018.
- Bowen Zhang (161 papers)
- Harold Soh (54 papers)