Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction (2403.07969v2)

Published 12 Mar 2024 in cs.LG and cs.AI

Abstract: In this paper, we propose KnowCoder, a LLM to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structured knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be captured in an LLM-friendly manner. We further construct a code-style schema library covering over $\textbf{30,000}$ types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning process of LLMs, KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning. After code pretraining on around $1.5$B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves relative improvements by $\textbf{49.8%}$ F1, compared to LLaMA2, under the few-shot setting. After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to $\textbf{12.5%}$ and $\textbf{21.9%}$, compared to sota baselines, under the zero-shot setting and the low resource setting, respectively. Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to $\textbf{7.5%}$ under the supervised setting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
  2. Dhananjay Ashok and Zachary C Lipton. 2023. Promptner: Prompting for named entity recognition. arXiv preprint arXiv:2305.15444.
  3. Automatically labeled data generation for large scale event extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 409–419, Vancouver, Canada. Association for Computational Linguistics.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  5. Broad Twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1169–1179, Osaka, Japan. The COLING 2016 Organizing Committee.
  6. Results of the wnut2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 140–147.
  7. Ncbi disease corpus: A resource for disease name recognition and concept normalization. Journal of biomedical informatics, 47:1–10.
  8. Runwei Guan. 2022. Findvehicle and vehiclefinder: A ner dataset for a text-image cross-modal vehicle retrieval system.
  9. Instructie: A chinese instruction-based information extraction dataset. arXiv preprint arXiv:2305.11527.
  10. Retrieval-augmented code generation for universal information extraction. arXiv preprint arXiv:2311.02962.
  11. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of Biomedical Informatics, 45(5):885–892. Text Mining and Natural Language Processing in Pharmacogenomics.
  12. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In *SEMEVAL.
  13. Knowledge Graphs. Number 22 in Synthesis Lectures on Data, Semantics, and Knowledge. Springer.
  14. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  15. Improving distantly supervised relation extraction using word and entity based attention. ArXiv, abs/1804.06987.
  16. Genia corpus—a semantically annotated corpus for bio-textmining. Bioinformatics (Oxford, England), 19 Suppl 1:i180–2.
  17. Veysel Kocaman and David Talby. 2020. Biomedical named entity recognition at scale. In ICPR Workshops.
  18. Aman Kumar and Binil Starly. 2021. “fabner”: information extraction from manufacturing process science domain literature using named entity recognition. Journal of Intelligent Manufacturing, 33:2393 – 2407.
  19. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database: The Journal of Biological Databases and Curation, 2016.
  20. Codeie: Large code generation models are better few-shot information extractors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 15339–15353. Association for Computational Linguistics.
  21. A joint neural model for information extraction with global features. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 7999–8009.
  22. A joint neural model for information extraction with global features. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7999–8009, Online. Association for Computational Linguistics.
  23. GCDT: A global context enhanced deep transition architecture for sequence labeling. CoRR, abs/1906.02437.
  24. Crossner: Evaluating cross-domain named entity recognition.
  25. Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
  26. Universal information extraction as unified semantic matching. AAAI.
  27. Text2event: Controllable sequence-to-structure generation for end-to-end event extraction. ArXiv, abs/2106.09232.
  28. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5755–5772, Dublin, Ireland. Association for Computational Linguistics.
  29. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction.
  30. Ace 2004 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 1:1–1.
  31. openbiocorpora. 2015. openbiocorpora anatem.
  32. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  33. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
  34. Modeling relations and their mentions without labeled text. In ECML/PKDD.
  35. Dan Roth and Wen tau Yih. 2004. A linear programming formulation for global inference in natural language tasks. In Conference on Computational Natural Language Learning.
  36. Gollie: Annotation guidelines improve zero-shot information-extraction. arXiv preprint arXiv:2310.03668.
  37. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition.
  38. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
  39. Simone Tedeschi and Roberto Navigli. 2022. MultiNERD: A multilingual, multi-genre and fine-grained dataset for named entity recognition (and disambiguation). In Findings of the Association for Computational Linguistics: NAACL 2022, pages 801–812, Seattle, United States. Association for Computational Linguistics.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  41. ACE 2005 Multilingual Training Corpus. LDC corpora. Linguistic Data Consortium.
  42. Deepstruct: Pretraining of language models for structure prediction.
  43. Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428.
  44. Instructuie: Multi-task instruction tuning for unified information extraction. arXiv preprint arXiv:2304.08085.
  45. Code4struct: Code generation for few-shot structured prediction from natural language. arXiv preprint arXiv:2210.12810.
  46. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23:170.
  47. Large language models for generative information extraction: A survey. arXiv preprint arXiv:2312.17617.
  48. Dongxu Zhang and Dong Wang. 2015. Relation classification via recurrent neural network.
  49. Knowlm technical report.
  50. Universalner: Targeted distillation from large language models for open named entity recognition. arXiv preprint arXiv:2308.03279.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Zixuan Li (63 papers)
  2. Yutao Zeng (18 papers)
  3. Yuxin Zuo (11 papers)
  4. Weicheng Ren (2 papers)
  5. Wenxuan Liu (28 papers)
  6. Miao Su (4 papers)
  7. Yucan Guo (4 papers)
  8. Yantao Liu (13 papers)
  9. Xiang Li (1003 papers)
  10. Zhilei Hu (4 papers)
  11. Long Bai (87 papers)
  12. Wei Li (1122 papers)
  13. Yidan Liu (4 papers)
  14. Pan Yang (11 papers)
  15. Xiaolong Jin (38 papers)
  16. Jiafeng Guo (161 papers)
  17. Xueqi Cheng (274 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com