Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mirror: A Universal Framework for Various Information Extraction Tasks (2311.05419v2)

Published 9 Nov 2023 in cs.CL and cs.AI

Abstract: Sharing knowledge between information extraction tasks has always been a challenge due to the diverse data formats and task variations. Meanwhile, this divergence leads to information waste and increases difficulties in building complex applications in real scenarios. Recent studies often formulate IE tasks as a triplet extraction problem. However, such a paradigm does not support multi-span and n-ary extraction, leading to weak versatility. To this end, we reorganize IE problems into unified multi-slot tuples and propose a universal framework for various IE tasks, namely Mirror. Specifically, we recast existing IE tasks as a multi-span cyclic graph extraction problem and devise a non-autoregressive graph decoding algorithm to extract all spans in a single step. It is worth noting that this graph structure is incredibly versatile, and it supports not only complex IE tasks, but also machine reading comprehension and classification tasks. We manually construct a corpus containing 57 datasets for model pretraining, and conduct experiments on 30 datasets across 8 downstream tasks. The experimental results demonstrate that our model has decent compatibility and outperforms or reaches competitive performance with SOTA systems under few-shot and zero-shot settings. The code, model weights, and pretraining corpus are available at https://github.com/Spico197/Mirror .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2895–2905, Florence, Italy. Association for Computational Linguistics.
  2. SubjQA: A Dataset for Subjectivity and Review Comprehension. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5480–5494, Online. Association for Computational Linguistics.
  3. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 632–642. The Association for Computational Linguistics.
  4. Chih-Yao Chen and Cheng-Te Li. 2021. Zs-bert: Towards zero-shot relation extraction with attribute representation learning.
  5. Crossroads, buildings and neighborhoods: A dataset for fine-grained location recognition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3329–3339, Seattle, United States. Association for Computational Linguistics.
  6. Hacred: A large-scale relation extraction dataset toward hard cases in practical applications. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2819–2831.
  7. A dataset for hyper-relational extraction and a cube-filling approach. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10114–10133, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  8. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  9. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.
  10. Broad Twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1169–1179, Osaka, Japan. The COLING 2016 Organizing Committee.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  12. Ncbi disease corpus: A resource for disease name recognition and concept normalization. Journal of biomedical informatics, 47:1–10.
  13. William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, IWP@IJCNLP 2005, Jeju Island, Korea, October 2005, 2005. Asian Federation of Natural Language Processing.
  14. Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  15. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.
  16. Lasuie: Unifying information extraction with latent adaptive structure-aware generative language model. In NeurIPS.
  17. Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics.
  18. Ralph Grishman. 2019. Twenty-five years of information extraction. Natural Language Engineering, 25(6):677–692.
  19. Delving deep into regularity: a simple but effective method for chinese named entity recognition. arXiv preprint arXiv:2204.05544.
  20. Findvehicle and vehiclefinder: A ner dataset for natural language-based vehicle retrieval and a keyword-based cross-modal vehicle retrieval system. ArXiv, abs/2304.10893.
  21. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of Biomedical Informatics, 45(5):885–892. Text Mining and Natural Language Processing in Pharmacogenomics.
  22. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation.
  23. The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation. ArXiv:2306.06918 [cs].
  24. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. CoRR, abs/2111.09543.
  25. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In *SEMEVAL.
  26. Dan Hendrycks and Kevin Gimpel. 2023. Gaussian Error Linear Units (GELUs). ArXiv:1606.08415 [cs].
  27. Ontonotes: The 90% solution. In North American Chapter of the Association for Computational Linguistics.
  28. Cosmos QA: machine reading comprehension with contextual commonsense reasoning. In EMNLP/IJCNLP (1), pages 2391–2401. Association for Computational Linguistics.
  29. Improving distantly supervised relation extraction using word and entity based attention. ArXiv, abs/1804.06987.
  30. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. CoRR, abs/2009.13081.
  31. BiPaR: A bilingual parallel dataset for multilingual and cross-lingual reading comprehension on novels. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2452–2462, Hong Kong, China. Association for Computational Linguistics.
  32. GenIE: Generative information extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4626–4643, Seattle, United States. Association for Computational Linguistics.
  33. Cadec: A corpus of adverse drug event annotations. J. Biomed. Informatics, 55:73–81.
  34. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, New Orleans, Louisiana. Association for Computational Linguistics.
  35. QASC: A dataset for question answering via sentence composition. In AAAI, pages 8082–8090. AAAI Press.
  36. Genia corpus—a semantically annotated corpus for bio-textmining. Bioinformatics (Oxford, England), 19 Suppl 1:i180–2.
  37. Veysel Kocaman and David Talby. 2020. Biomedical named entity recognition at scale. In ICPR Workshops.
  38. The chemdner corpus of chemicals and drugs and its annotation principles. Journal of Cheminformatics, 7:S2.
  39. Aman Kumar and Binil Starly. 2021. “fabner”: information extraction from manufacturing process science domain literature using named entity recognition. Journal of Intelligent Manufacturing, 33:2393 – 2407.
  40. RACE: large-scale reading comprehension dataset from examinations. In EMNLP, pages 785–794. Association for Computational Linguistics.
  41. Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195.
  42. The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, page 552–561. AAAI Press.
  43. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  44. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database: The Journal of Biological Databases and Curation, 2016.
  45. Unified named entity recognition as word-word relation classification. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 10965–10973. AAAI Press.
  46. A joint neural model for information extraction with global features. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7999–8009, Online. Association for Computational Linguistics.
  47. Asgard: A portable architecture for multilingual dialogue systems. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 8386–8390. IEEE.
  48. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  49. Crossner: Evaluating cross-domain named entity recognition. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 13452–13460. AAAI Press.
  50. Universal information extraction as unified semantic matching. CoRR, abs/2301.03282.
  51. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5755–5772, Dublin, Ireland. Association for Computational Linguistics.
  52. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3219–3232, Brussels, Belgium. Association for Computational Linguistics.
  53. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  54. Can a suit of armor conduct electricity? A new dataset for open book question answering. In EMNLP, pages 2381–2391. Association for Computational Linguistics.
  55. Ace 2004 multilingual training corpus.
  56. MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268.
  57. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  58. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
  59. Structured prediction as translation between augmented natural languages. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  60. SemEval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 19–30, San Diego, California. Association for Computational Linguistics.
  61. SemEval-2015 task 12: Aspect based sentiment analysis. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 486–495, Denver, Colorado. Association for Computational Linguistics.
  62. SemEval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27–35, Dublin, Ireland. Association for Computational Linguistics.
  63. Sampo Pyysalo and Sophia Ananiadou. 2013. Anatomical entity mention recognition at literature scale. Bioinformatics, 30(6):868–875.
  64. A survey on arabic named entity recognition: Past, recent advances, and future trends. arXiv preprint arXiv:2302.03512.
  65. Distantly-supervised named entity recognition with adaptive teacher learning and fine-grained student ensemble. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13501–13509.
  66. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  67. Explain yourself! leveraging language models for commonsense reasoning. In ACL (1), pages 4932–4942. Association for Computational Linguistics.
  68. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  69. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, pages 148–163, Berlin, Heidelberg. Springer Berlin Heidelberg.
  70. Dan Roth and Wen-tau Yih. 2004. A linear programming formulation for global inference in natural language tasks. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004, pages 1–8, Boston, Massachusetts, USA. Association for Computational Linguistics.
  71. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732–8740. AAAI Press.
  72. Casie: Extracting cybersecurity event information from text. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8749–8757.
  73. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1631–1642. ACL.
  74. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.
  75. Results of the WNUT16 named entity recognition shared task. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), pages 138–144, Osaka, Japan. The COLING 2016 Organizing Committee.
  76. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864.
  77. Global Pointer: Novel Efficient Span-based Approach for Named Entity Recognition. ArXiv:2208.03054 [cs].
  78. DREAM: A challenge dataset and models for dialogue-based reading comprehension. Trans. Assoc. Comput. Linguistics, 7:217–231.
  79. Phee: A dataset for pharmacovigilance event extraction from text. ArXiv, abs/2210.12560.
  80. A hierarchical framework for relation extraction with reinforcement learning. In AAAI Conference on Artificial Intelligence.
  81. Simone Tedeschi and Roberto Navigli. 2022. MultiNERD: A multilingual, multi-genre and fine-grained dataset for named entity recognition (and disambiguation). In Findings of the Association for Computational Linguistics: NAACL 2022, pages 801–812, Seattle, United States. Association for Computational Linguistics.
  82. M. Therasa and G. Mathivanan. 2022. Survey of machine reading comprehension models and its evaluation metrics. In 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), pages 1006–1013.
  83. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-2003, pages 142–147. Edmonton, Canada.
  84. Newsqa: A machine comprehension dataset. CoRR, abs/1611.09830.
  85. Named entity recognition in twitter: A dataset and analysis on short-term temporal shifts.
  86. Entity, relation, and event extraction with contextualized span representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5784–5789, Hong Kong, China. Association for Computational Linguistics.
  87. Ace 2005 multilingual training corpus.
  88. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  89. DeepStruct: Pretraining of language models for structure prediction. In Findings of the Association for Computational Linguistics: ACL 2022, pages 803–823, Dublin, Ireland. Association for Computational Linguistics.
  90. InstructUIE: Multi-task Instruction Tuning for Unified Information Extraction. ArXiv:2304.08085 [cs].
  91. Discontinuous named entity recognition as maximal clique discovery. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 764–774, Online. Association for Computational Linguistics.
  92. Neural network acceptability judgments. Trans. Assoc. Comput. Linguistics, 7:625–641.
  93. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark. Association for Computational Linguistics.
  94. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
  95. A unified generative framework for aspect-based sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2416–2429, Online. Association for Computational Linguistics.
  96. A unified generative framework for various NER subtasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5808–5822, Online. Association for Computational Linguistics.
  97. Zero-shot learners for natural language understanding via a unified multiple choice perspective. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7042–7055, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  98. Reclor: A reading comprehension dataset requiring logical reasoning. CoRR, abs/2002.04326.
  99. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics.
  100. Dongxu Zhang and Dong Wang. 2015. Relation classification via recurrent neural network. CoRR, abs/1508.01006.
  101. Character-level convolutional networks for text classification. In NIPS, pages 649–657.
  102. Efficient document-level event extraction via pseudo-trigger-aware pruned complete graph. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 4552–4558. International Joint Conferences on Artificial Intelligence Organization. Main Track.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Tong Zhu (43 papers)
  2. Junfei Ren (2 papers)
  3. Zijian Yu (5 papers)
  4. Mengsong Wu (8 papers)
  5. Guoliang Zhang (5 papers)
  6. Xiaoye Qu (62 papers)
  7. Wenliang Chen (33 papers)
  8. Zhefeng Wang (39 papers)
  9. Baoxing Huai (28 papers)
  10. Min Zhang (630 papers)
Citations (12)

Summary

An Overview of Mirror: A Universal Framework for Information Extraction

The paper "Mirror: A Universal Framework for Various Information Extraction Tasks" presents an innovative approach to unifying Information Extraction (IE) tasks through a proposed framework called Mirror. The authors address the prevalent issue within the NLP domain of handling diverse data formats and task variations, which have traditionally limited efficiency and versatility in building comprehensive information extraction systems. This paper introduces a method that reformulates these tasks into a unified paradigm that employs multi-slot tuples and cyclic graphs, thereby enhancing the scope and adaptability of IE models.

Key Contributions

  1. Unified Multi-Slot Tuple Framework: The authors propose representing various IE tasks as multi-slot tuple extraction problems. This abstraction is operationalized through the transformation of these structures into multi-span cyclic graphs, which are decoded using a non-autoregressive algorithm. This approach allows the system to handle a range of tasks, from complex entity recognition to relation extraction and machine reading comprehension, all within a singular framework.
  2. Non-Autoregressive Decoding: Mirror introduces a non-autoregressive graph decoding algorithm that efficiently handles the extraction of all spans corresponding to a task in a single step. This design choice significantly improves the model's inference speed while maintaining competitive accuracy compared to state-of-the-art models, particularly in scenarios with limited training data, such as few-shot and zero-shot settings.
  3. Corpus Construction for Pretraining: A comprehensive corpus consisting of 57 datasets was manually assembled for model pretraining, spanning 8 different types of downstream tasks. This resource facilitates the model's ability to generalize across different information extraction tasks, bolstering its capacity to perform in few-shot and zero-shot contexts.
  4. Performance Evaluation: The framework was evaluated across 30 datasets covering 8 tasks. The results exhibit that Mirror not only competes closely with the state-of-the-art systems but often surpasses them in compatibility and efficacy across various scenarios. This includes advances in traditionally difficult tasks such as multi-span discontinuous Named Entity Recognition (NER) and n-ary relation extraction.

Results and Comparisons

The experiments conducted demonstrate that Mirror achieves competitive performance in traditional IE tasks. For instance, the model reported an F1 score of up to 94.25 on the NYT dataset, indicating robust relationship extraction capabilities. Notably, Mirror's non-autoregressive design achieved significant speed advantages over autoregressive models, with a reported speed-up of more than 30 times when processing the CoNLL03 dataset.

The analysis of few-shot and zero-shot efficacy further emphasizes Mirror’s strength in learning representations that translate well to new, unseen data domains, a requisite feature for adapting to real-world applications quickly.

Implications and Future Directions

By framing IE tasks in a unified and extendable manner, the Mirror framework offers practical benefits in developing NLP applications that are not only faster but also scalable across various domains and languages. This simplification potentially reduces the computational resources required for deployment and might enable broader applicability across different computational environments and platforms.

However, beyond performance metrics, the Mirror framework also opens pathways for further research into the integration of multi-modal data inputs and more nuanced schema-guided information extraction methods. It posits the potential for exploring deeper semantic understanding in natural language processing, which could catalyze advancements in AI comprehension capabilities.

In sum, "Mirror: A Universal Framework for Various Information Extraction Tasks" provides a significant contribution to the field of NLP by advancing the state of universal IE systems, harnessing flexibility, speed, and efficiency, which are pivotal for the next generation of intelligent information processing systems.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com