Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus (2402.14710v3)

Published 22 Feb 2024 in cs.CL, cs.AI, cs.DB, cs.IR, and cs.LG

Abstract: LLMs demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Learning to rank context for named entity recognition using a synthetic dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 10372–10382. Association for Computational Linguistics.
  2. Qwen technical report. CoRR, abs/2309.16609.
  3. Language models are few-shot learners. In NeurIPS 2020.
  4. Xavier Carreras and Lluís Màrquez. 2004. Introduction to the conll-2004 shared task: Semantic role labeling. In Proceedings of the Eighth Conference on Computational Natural Language Learning, CoNLL 2004, Held in cooperation with HLT-NAACL 2004, Boston, Massachusetts, USA, May 6-7, 2004, pages 89–97. ACL.
  5. Chih-Yao Chen and Cheng-Te Li. 2021. ZS-BERT: towards zero-shot relation extraction with attribute representation learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 3470–3479. Association for Computational Linguistics.
  6. Crossroads, buildings and neighborhoods: A dataset for fine-grained location recognition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 3329–3339. Association for Computational Linguistics.
  7. Sequence labeling as non-autoregressive dual-query set generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  8. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, pages 2778–2788. ACM.
  9. Relationprompt: Leveraging prompts to generate synthetic data for zero-shot relation triplet extraction. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 45–57. Association for Computational Linguistics.
  10. NCBI disease corpus: A resource for disease name recognition and concept normalization. J. Biomed. Informatics, 47:1–10.
  11. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  12. Multi-sentence argument linking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  13. Query of CC: unearthing large scale domain-specific knowledge from public corpora. CoRR, abs/2401.14624.
  14. Benchmarking large language models with augmented instructions for fine-grained information extraction. CoRR, abs/2310.05092.
  15. Findvehicle and vehiclefinder: A NER dataset for natural language-based vehicle retrieval and a keyword-based cross-modal vehicle retrieval system. CoRR, abs/2304.10893.
  16. Cmeie: Construction and evaluation of chinese medical information extraction dataset. In Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October 14-18, 2020, Proceedings, Part I, volume 12430 of Lecture Notes in Computer Science, pages 270–282. Springer.
  17. Extraction of adverse drug effects from medical case reports. J. Biomed. Semant., 3:15.
  18. Duee-fin: A large-scale dataset for document-level event extraction. In Natural Language Processing and Chinese Computing - 11th CCF International Conference, NLPCC 2022, Guilin, China, September 24-25, 2022, Proceedings, Part I, volume 13551 of Lecture Notes in Computer Science, pages 172–183. Springer.
  19. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4803–4809. Association for Computational Linguistics.
  20. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, 25-27 April 2023, Palau de Congressos, Valencia, Spain, volume 206 of Proceedings of Machine Learning Research, pages 5549–5581. PMLR.
  21. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval@ACL 2010, Uppsala University, Uppsala, Sweden, July 15-16, 2010, pages 33–38. The Association for Computer Linguistics.
  22. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  23. A reevaluation of event extraction: Past, present, and future challenges. CoRR, abs/2311.09562.
  24. Improving distantly supervised relation extraction using word and entity based attention. In 6th Workshop on Automated Knowledge Base Construction, AKBC@NIPS 2017, Long Beach, California, USA, December 8, 2017. OpenReview.net.
  25. Mistral 7b. CoRR, abs/2310.06825.
  26. Instruct and extract: Instruction tuning for on-demand information extraction. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 10030–10051. Association for Computational Linguistics.
  27. GENIA corpus - a semantically annotated corpus for bio-textmining. In Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology, June 29 - July 3, 2003, Brisbane, Australia, pages 180–182.
  28. Veysel Kocaman and David Talby. 2020. Biomedical named entity recognition at scale. In Pattern Recognition. ICPR International Workshops and Challenges - Virtual Event, January 10-15, 2021, Proceedings, Part I, volume 12661 of Lecture Notes in Computer Science, pages 635–646. Springer.
  29. Aman Kumar and Binil Starly. 2022. "fabner": information extraction from manufacturing process science domain literature using named entity recognition. J. Intell. Manuf., 33(8):2393–2407.
  30. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8424–8445. Association for Computational Linguistics.
  31. Crudeoilnews: An annotated crude oil news corpus for event extraction. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 465–479. European Language Resources Association.
  32. Gina-Anne Levow. 2006. The third international chinese language processing bakeoff: Word segmentation and named entity recognition. In Proceedings of the Fifth Workshop on Chinese Language Processing, SIGHAN@COLING/ACL 2006, Sydney, Australia, July 22-23, 2006, pages 108–117. Association for Computational Linguistics.
  33. Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. CoRR, abs/2304.11633.
  34. Codeie: Large code generation models are better few-shot information extractors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 15339–15353. Association for Computational Linguistics.
  35. Document-level event argument extraction by conditional generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 894–908. Association for Computational Linguistics.
  36. Duie: A large-scale chinese dataset for information extraction. In Natural Language Processing and Chinese Computing - 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9-14, 2019, Proceedings, Part II, volume 11839 of Lecture Notes in Computer Science, pages 791–800. Springer.
  37. A unified MRC framework for named entity recognition. In ACL 2020, pages 5849–5859. Association for Computational Linguistics.
  38. Duee: A large-scale dataset for chinese event extraction in real-world scenarios. In Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October 14-18, 2020, Proceedings, Part II, volume 12431 of Lecture Notes in Computer Science, pages 534–545. Springer.
  39. Asgard: A portable architecture for multilingual dialogue systems. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 8386–8390. IEEE.
  40. Crossner: Evaluating cross-domain named entity recognition. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 13452–13460. AAAI Press.
  41. Universal information extraction as unified semantic matching. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 13318–13326. AAAI Press.
  42. PIVOINE: instruction tuning for open-world information extraction. CoRR, abs/2305.14898.
  43. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 5755–5772. Association for Computational Linguistics.
  44. Unified structure generation for universal information extraction. In ACL 2022, pages 5755–5772. Association for Computational Linguistics.
  45. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 3219–3232. Association for Computational Linguistics.
  46. Large language model is not a good few-shot information extractor, but a good reranker for hard samples! In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 10572–10601. Association for Computational Linguistics.
  47. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  48. Structured prediction as translation between augmented natural languages. In ICLR 2021. OpenReview.net.
  49. Nanyun Peng and Mark Dredze. 2015. Named entity recognition for chinese social media with jointly trained embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 548–554. The Association for Computational Linguistics.
  50. Sameer S. Pradhan and Nianwen Xue. 2009. Ontonotes: The 90% solution. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, USA, Tutorial Abstracts, pages 11–12. The Association for Computational Linguistics.
  51. Summarization is (almost) dead. CoRR, abs/2309.09558.
  52. Sampo Pyysalo and Sophia Ananiadou. 2014. Anatomical entity mention recognition at literature scale. Bioinform., 30(6):868–875.
  53. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part III, volume 6323 of Lecture Notes in Computer Science, pages 148–163. Springer.
  54. Gollie: Annotation guidelines improve zero-shot information-extraction. CoRR, abs/2310.03668.
  55. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, pages 142–147. ACL.
  56. CASIE: extracting cybersecurity event information from text. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8749–8757. AAAI Press.
  57. PHEE: A dataset for pharmacovigilance event extraction from text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5571–5587. Association for Computational Linguistics.
  58. A hierarchical framework for relation extraction with reinforcement learning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 7072–7079. AAAI Press.
  59. Simone Tedeschi and Roberto Navigli. 2022. Multinerd: A multilingual, multi-genre and fine-grained dataset for named entity recognition (and disambiguation). In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 801–812. Association for Computational Linguistics.
  60. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  61. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  62. Prompting palm for translation: Assessing strategies and performance. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 15406–15427. Association for Computational Linguistics.
  63. Revisiting relation extraction in the era of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 15566–15589. Association for Computational Linguistics.
  64. Ace 2005 multilingual training corpus.
  65. GPT-RE: in-context learning for relation extraction using large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3534–3547. Association for Computational Linguistics.
  66. Ipre: a dataset for inter-personal relationship extraction. In Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II 8, pages 103–115. Springer.
  67. Techgpt-2.0: A large language model project to solve the task of knowledge graph construction.
  68. GPT-NER: named entity recognition via large language models. CoRR, abs/2304.10428.
  69. Instructuie: Multi-task instruction tuning for unified information extraction. CoRR, abs/2304.08085.
  70. Generative AI for math: Part I - mathpile: A billion-token-scale pretraining corpus for math. CoRR, abs/2312.17120.
  71. Zero-shot information extraction via chatting with chatgpt. CoRR, abs/2302.10205.
  72. Multimodal large language models: A survey. In IEEE International Conference on Big Data, BigData 2023, Sorrento, Italy, December 15-18, 2023, pages 2247–2256. IEEE.
  73. YAYI-UIE: A chat-enhanced instruction tuning framework for universal information extraction. CoRR, abs/2312.15548.
  74. Self-improving for zero-shot named entity recognition with large language models. CoRR, abs/2311.08921.
  75. Large language models for generative information extraction: A survey. CoRR, abs/2312.17617.
  76. CLUENER2020: fine-grained named entity recognition dataset and benchmark for chinese. CoRR, abs/2001.04351.
  77. Baichuan 2: Open large-scale language models. CoRR, abs/2309.10305.
  78. If LLM is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents. CoRR, abs/2401.00812.
  79. Dongxu Zhang and Dong Wang. 2015. Relation classification via recurrent neural network. CoRR, abs/1508.01006.
  80. 2iner: Instructive and in-context learning on few-shot named entity recognition. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 3940–3951. Association for Computational Linguistics.
  81. Optimizing bi-encoder for named entity recognition via contrastive learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  82. Yue Zhang and Jie Yang. 2018. Chinese NER using lattice LSTM. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1554–1564. Association for Computational Linguistics.
  83. A survey of large language models. CoRR, abs/2303.18223.
  84. Joint extraction of entities and relations based on a novel tagging scheme. In ACL 2017, pages 1227–1236. Association for Computational Linguistics.
  85. What the role is vs. what plays the role: Semi-supervised event argument extraction via dual question answering. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 14638–14646. AAAI Press.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Honghao Gui (8 papers)
  2. Hongbin Ye (16 papers)
  3. Lin Yuan (37 papers)
  4. Ningyu Zhang (148 papers)
  5. Mengshu Sun (41 papers)
  6. Lei Liang (37 papers)
  7. Huajun Chen (198 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com