Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction (2404.03868v2)

Published 5 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on LLMs has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that, in prior methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schemas easily exceed the LLMs' context window length. Furthermore, there are scenarios where a fixed pre-defined schema is not available and we would like the method to construct a high-quality KG with a succinct self-generated schema. To address these problems, we propose a three-phase framework named Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization. EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not; in the latter case, it constructs a schema automatically and applies self-canonicalization. To further improve performance, we introduce a trained component that retrieves schema elements relevant to the input text; this improves the LLMs' extraction performance in a retrieval-augmented generation-like manner. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works. Code for EDC is available at https://github.com/clear-nus/edc.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. arXiv preprint arXiv:2010.12688, 2020.
  3. Codekgc: Code language model for generative knowledge graph construction. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(3):1–16, 2024.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Rebel: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  2370–2381, 2021.
  6. Ultra-fine entity typing. arXiv preprint arXiv:1807.04905, 2018.
  7. Open knowledge graphs canonicalization using variational autoencoders. arXiv preprint arXiv:2012.04780, 2020.
  8. Neural relation extraction for knowledge base enrichment. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  229–240, 2019.
  9. Regen: Reinforcement learning for text and knowledge base generation using pretrained language models. arXiv preprint arXiv:2108.12472, 2021.
  10. The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020). In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), 2020.
  11. Word embedding based generalized language model for information retrieval. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp.  795–798, 2015.
  12. Ppdb: The paraphrase database. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.  758–764, 2013.
  13. An automatic machining process decision-making system based on knowledge graph. International journal of computer integrated manufacturing, 34(12):1348–1369, 2021.
  14. A survey on knowledge graph-based recommender systems. IEEE Transactions on Knowledge and Data Engineering, 34(8):3549–3568, 2020.
  15. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of biomedical informatics, 45(5):885–892, 2012.
  16. Is information extraction solved by chatgpt? an analysis of performance, evaluation criteria, robustness and errors. arXiv preprint arXiv:2305.14450, 2023.
  17. Knowledge graph embedding based question answering. In Proceedings of the twelfth ACM international conference on web search and data mining, pp.  105–113, 2019.
  18. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE transactions on neural networks and learning systems, 33(2):494–514, 2021.
  19. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  20. GenIE: Generative information extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4626–4643, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.342. URL https://aclanthology.org/2022.naacl-main.342.
  21. Open information extraction: A review of baseline techniques, approaches, and applications. arXiv preprint arXiv:2310.11644, 2023.
  22. Openie6: Iterative grid labeling and coordination analysis for open information extraction. arXiv preprint arXiv:2010.03147, 2020.
  23. A new complex fuzzy inference system with fuzzy knowledge graph and extensions in decision making. Ieee Access, 8:164899–164921, 2020.
  24. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  25. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  26. Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. arXiv preprint arXiv:2304.11633, 2023.
  27. Open information extraction from 2007 to 2022–a survey. arXiv preprint arXiv:2208.08690, 2022.
  28. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. arXiv preprint arXiv:1808.09602, 2018.
  29. Joint learning of named entity recognition and entity linking. arXiv preprint arXiv:1907.08243, 2019.
  30. Knowledge graph generation from text. arXiv preprint arXiv:2211.10511, 2022.
  31. George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  32. Fine-grained entity typing for domain independent entity linking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  8576–8583, 2020.
  33. Aligning open ie relations and kb relations using a siamese network based on word embedding. In Proceedings of the 13th International Conference on Computational Semantics-Long Papers, pp.  142–153, 2019.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  35. A linear programming formulation for global inference in natural language tasks. In Proceedings of the eighth conference on computational natural language learning (CoNLL-2004) at HLT-NAACL 2004, pp.  1–8, 2004.
  36. Relation extraction using distant supervision: A survey. ACM Computing Surveys (CSUR), 51(5):1–35, 2018.
  37. Cesi: Canonicalizing open knowledge bases using embeddings and side information. In Proceedings of the 2018 World Wide Web Conference, pp.  1317–1327, 2018.
  38. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85, 2014.
  39. Revisiting relation extraction in the era of large language models. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2023, pp.  15566. NIH Public Access, 2023.
  40. Knowledge graph convolutional networks for recommender systems. In The world wide web conference, pp.  3307–3313, 2019.
  41. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
  42. Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:2302.10205, 2023.
  43. Qa-gnn: Reasoning with language models and knowledge graphs for question answering. arXiv preprint arXiv:2104.06378, 2021.
  44. Generative knowledge graph construction: A review. arXiv preprint arXiv:2210.12714, 2022.
  45. Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp.  2335–2344, 2014.
  46. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp.  1753–1762, 2015.
  47. A comprehensive survey on automatic knowledge graph construction. ACM Computing Surveys, 56(4):1–62, 2023.
  48. A survey on neural open information extraction: Current status and future directions. arXiv preprint arXiv:2205.11725, 2022.
  49. Named entity recognition with parallel recurrent neural networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  69–74, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Bowen Zhang (161 papers)
  2. Harold Soh (54 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com