Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation (2306.06918v2)

Published 12 Jun 2023 in cs.CL and cs.AI

Abstract: Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OMNIEVENT, which can be obtained from https://github.com/THU-KEG/OmniEvent.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (143)
  1. David Ahn. 2006. The stages of event extraction. In Proceedings of ACL Workshop on Annotating and Reasoning about Time and Events, pages 1–8.
  2. Jun Araki and Teruko Mitamura. 2018. Open-domain event detection using distant supervision. In Proceedings of COLING, pages 878–891.
  3. Sub-event detection from twitter streams as a sequence labeling problem. In Proceedings of NAACL-HLT, pages 745–750.
  4. Seed-based event trigger labeling: How far can event descriptions get us? In Proceedings of ACL-IJCNLP, pages 372–376.
  5. Language models are few-shot learners. In Proceedings of NeurIPS, volume 33, pages 1877–1901.
  6. OneEE: A one-stage framework for fast overlapping and nested event extraction. In Proceedings of COLING, pages 1953–1964.
  7. Incremental event detection via knowledge consolidation networks. In Proceedings of EMNLP, pages 707–717.
  8. Rapid customization for event extraction. In Proceedings of ACL: System Demonstrations, pages 31–36.
  9. Honey or poison? Solving the trigger curse in few-shot event detection via causal intervention. In Proceedings of EMNLP, pages 8078–8088.
  10. Automatically Labeled Data Generation for Large Scale Event Extraction. In Proceedings of ACL, pages 409–419.
  11. Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of ACL-IJCNLP, pages 167–176.
  12. Collective event detection via a hierarchical and bias tagging networks with gated multi-level attention mechanisms. In Proceedings of EMNLP, pages 1267–1276.
  13. Few-Shot Event Detection with Prototypical Amortized Conditional Random Field. In Findings of ACL-IJCNLP, pages 28–40.
  14. Edge-enhanced graph convolution networks for event detection with syntactic relation. In Findings of EMNLP, pages 2329–2339.
  15. OntoED: Low-resource event detection with ontology embedding. In Proceedings of ACL-IJCNLP, pages 2828–2839.
  16. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  17. Event detection with trigger-aware lattice neural network. In Proceedings of EMNLP-IJCNLP, pages 347–356.
  18. Xinya Du and Claire Cardie. 2020a. Document-level event role filler extraction using multi-granularity contextualized encoding. In Proceedings of ACL, pages 8010–8020.
  19. Xinya Du and Claire Cardie. 2020b. Event extraction by answering (almost) natural questions. In Proceedings of EMNLP, pages 671–683.
  20. Dynamic global memory for document-level argument extraction. In Proceedings of ACL, pages 5264–5275.
  21. Multi-sentence argument linking. In Proceedings of ACL, pages 8057–8077.
  22. Overview of linguistic resources for the TAC KBP 2015 evaluations: Methodologies and results. In TAC.
  23. Overview of Linguistic Resources for the TAC KBP 2016 Evaluations: Methodologies and Results. In TAC.
  24. Overview of linguistic resources for the TAC KBP 2014 evaluations: Planning, execution, and results. In TAC.
  25. A search-based neural model for biomedical nested and overlapping event detection. In Proceedings of EMNLP-IJCNLP, pages 3679–3686.
  26. A language-independent neural network for event detection. In Proceedings of ACL, pages 66–71.
  27. Event detection with burst information networks. In Proceedings of COLING, pages 3276–3286.
  28. Overview of linguistic resources for the tac kbp 2017 evaluations: Methodologies and results. In TAC.
  29. Event nugget detection with forward-backward recurrent neural networks. In Proceedings of ACL, pages 369–373.
  30. Goran Glavaš and Jan Šnajder. 2014. Event graphs for information retrieval and multi-document summarization. Expert systems with applications, 41(15):6904–6916.
  31. Prashant Gupta and Heng Ji. 2009. Predicting Unknown Time Arguments based on Cross-Event Propagation. In Proceedings of ACL-IJCNLP, pages 369–372.
  32. Cross-lingual event detection via optimized adversarial training. In Proceedings of NAACL-HLT, pages 5588–5599.
  33. A survey of event extraction methods from text for decision support systems. Decision Support Systems, 85:12–22.
  34. Using cross-entity inference to improve event extraction. In Proceedings of ACL-HLT, pages 1127–1136.
  35. Leveraging multilingual training for limited resource event extraction. In Proceedings of COLING, pages 1201–1210.
  36. DEGREE: A data-efficient generation-based event extraction model. In Proceedings of NAACL-HLT, pages 1890–1908.
  37. Multilingual generative language models for zero-shot cross-lingual event argument extraction. In Proceedings of ACL, pages 4633–4646.
  38. Biomedical event extraction with hierarchical knowledge graphs. In Findings of EMNLP, pages 1277–1285.
  39. Liberal Event Extraction and Event Schema Induction. In Proceedings of ACL, pages 258–268.
  40. Lifu Huang and Heng Ji. 2020. Semi-supervised New Event Type Induction and Event Detection. In Proceedings of EMNLP, pages 718–724.
  41. Joint event extraction with hierarchical policy network. In Proceedings of COLING, pages 2653–2664.
  42. Yusheng Huang and Weijia Jia. 2021. Exploring sentence community for document-level event extraction. In Findings of EMNLP, pages 340–351.
  43. Diamonds in the rough: Event extraction from imperfect microblog data. In Proceedings of NAACL-HLT, pages 641–650.
  44. Abhyuday N Jagannatha and Hong Yu. 2016. Bidirectional RNN for medical event detection in electronic health records. In Proceedings of NAACL-HLT, pages 473–482.
  45. Heng Ji and Ralph Grishman. 2008. Refining event extraction through cross-document inference. In Proceedings of ACL, pages 254–262.
  46. Heng Ji and Ralph Grishman. 2011. Knowledge Base Population: Successful Approaches and Challenges. In Proceedings of ACL, pages 1148–1158.
  47. Alex Judea and Michael Strube. 2016. Incremental global event extraction. In Proceedings of COLING, pages 2279–2289.
  48. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, pages 282–289.
  49. Learning prototype representations across few-shot tasks for event detection. In Proceedings of EMNLP, pages 5270–5277.
  50. Event Detection: Gate Diversity and Syntactic Importance Scores for Graph Convolution Neural Networks. In Proceedings of EMNLP, pages 5405–5411.
  51. Event detection and factuality assessment with non-expert supervision. In Proceedings of EMNLP, pages 1643–1648.
  52. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of ACL, pages 7871–7880.
  53. Biomedical event extraction based on knowledge-driven tree-LSTM. In Proceedings of NAACL-HLT, pages 1421–1430.
  54. Event extraction as multi-turn question answering. In Findings of EMNLP, pages 829–838.
  55. KiPT: Knowledge-injected prompt tuning for event detection. In Proceedings of COLING, pages 1943–1952.
  56. Joint event extraction via structured prediction with global features. In Proceedings of ACL, pages 73–82.
  57. Treasures outside contexts: Improving event detection via global statistics. In Proceedings of EMNLP, pages 2625–2635.
  58. Document-level event argument extraction by conditional generation. In Proceedings of NAACL-HLT, pages 894–908.
  59. Duee: A large-scale dataset for chinese event extraction in real-world scenarios. In Proceedings of NLPCC, volume 12431 of Lecture Notes in Computer Science, pages 534–545.
  60. Unregulated Chinese-to-English data expansion does NOT work for neural event detection. In Proceedings of COLING, pages 2633–2638.
  61. Cost-sensitive regularization for label confusion-aware event detection. In Proceedings of ACL, pages 5278–5283.
  62. A joint neural model for information extraction with global features. In Proceedings of ACL, pages 7999–8009.
  63. Self-attention graph residual convolutional networks for event detection with dependency relations. In Findings of EMNLP, pages 302–311.
  64. Event Extraction as Machine Reading Comprehension. In Proceedings of EMNLP, pages 1641–1651.
  65. How does context matter? On the robustness of event detection with context-selective mask generalization. In Findings of EMNLP, pages 2523–2532.
  66. Neural cross-lingual event detection with minimal parallel resources. In Proceedings of EMNLP-IJCNLP, pages 738–748.
  67. Machine reading comprehension as data augmentation: A case study on implicit event argument extraction. In Proceedings of EMNLP, pages 2716–2725.
  68. Saliency as evidence: Event detection with trigger saliency attribution. In Proceedings of ACL, pages 4573–4585.
  69. Incremental prompting: Episodic memory prompt for lifelong event detection. In Proceedings of COLING, pages 2157–2165.
  70. Exploiting contextual information via dynamic memory network for event detection. In Proceedings of EMNLP, pages 1030–1035.
  71. Leveraging FrameNet to improve automatic event detection. In Proceedings of ACL, pages 2134–2143.
  72. Exploiting Argument Information to Improve Event Detection via Supervised Attention Mechanisms. In Proceedings of ACL, pages 1789–1798.
  73. Event detection without triggers. In Proceedings of NAACL-HLT, pages 735–744.
  74. Dynamic prefix-tuning for generative template-based event extraction. In Proceedings of ACL, pages 5216–5228.
  75. Jointly multiple events extraction via attention-based graph information aggregation. In Proceedings of EMNLP, pages 1247–1256.
  76. MLBiNet: A cross-sentence collective event detection network. In Proceedings of ACL-IJCNLP, pages 4829–4839.
  77. Weiyi Lu and Thien Huu Nguyen. 2018. Similar but not the same: Word sense disambiguation improves event detection via neural representation matching. In Proceedings of EMNLP, pages 4822–4828.
  78. Distilling discrimination and generalization knowledge for event detection via delta-representation learning. In Proceedings of ACL, pages 4366–4376.
  79. Text2Event: Controllable sequence-to-structure generation for end-to-end event extraction. In Proceedings of ACL-IJCNLP, pages 2795–2806.
  80. Zero-shot event extraction via transfer learning: Challenges and insights. In Proceedings of ACL-IJCNLP, pages 322–332.
  81. Resource-enhanced neural model for event argument extraction. In Findings of EMNLP, pages 3554–3559.
  82. Prompt for extraction? PAIE: Prompting argument interaction for event argument extraction. In Proceedings of ACL, pages 6759–6774.
  83. Introducing a new dataset for event detection in cybersecurity texts. In Proceedings of EMNLP, pages 5381–5390.
  84. Event detection with dual relational graph attention networks. In Proceedings of COLING, pages 1979–1989.
  85. Aakanksha Naik and Carolyn Rose. 2020. Towards open domain event trigger identification using adversarial domain adaptation. In Proceedings of ACL, pages 7618–7624.
  86. Unsupervised domain adaptation for event detection using domain-specific adapters. In Findings of ACL-IJCNLP, pages 4015–4025.
  87. Joint extraction of entities, relations, and events via modeling inter-instance and inter-label dependencies. In Proceedings of NAACL-HLT, pages 4363–4374.
  88. Crosslingual transfer learning for relation and event extraction via word category and class alignments. In Proceedings of EMNLP, pages 5414–5426.
  89. Thien Nguyen and Ralph Grishman. 2018. Graph convolutional networks with argument-aware pooling for event detection. In Proceedings of AAAI, pages 5900–5907.
  90. Joint event extraction via recurrent neural networks. In Proceedings of NAACL-HLT, pages 300–309.
  91. Thien Huu Nguyen and Ralph Grishman. 2015. Event Detection and Domain Adaptation with Convolutional Neural Networks. In Proceedings of ACL, pages 365–371.
  92. Thien Huu Nguyen and Ralph Grishman. 2016. Modeling skip-grams for event detection with convolutional neural networks. In Proceedings of EMNLP, pages 886–891.
  93. Event detection with neural networks: A rigorous empirical evaluation. In Proceedings of EMNLP, pages 999–1004.
  94. Event detection and co-reference with minimal supervision. In Proceedings of EMNLP, pages 392–402.
  95. Unleash GPT-2 power for event detection. In Proceedings of ACL-IJCNLP, pages 6271–6282.
  96. Document-level event argument extraction via optimal transport. In Findings of ACL, pages 1648–1658.
  97. Modeling document-level context for event detection via important context selection. In Proceedings of EMNLP, pages 5403–5413.
  98. Graph Transformer Networks with Syntactic and Semantic Structures for Event Argument Extraction. In Findings of EMNLP, pages 3651–3661.
  99. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  100. Biomedical event extraction as sequence labeling. In Proceedings of EMNLP, pages 5357–5367.
  101. Lance Ramshaw and Mitch Marcus. 1995. Text chunking using transformation-based learning. In Third Workshop on Very Large Corpora.
  102. CLIO: Role-interactive multi-event head attention network for document-level event extraction. In Proceedings of COLING, pages 2504–2514.
  103. Textual entailment for event argument extraction: Zero- and few-shot with multi-source learning. In Findings of NAACL-HLT, pages 2439–2455.
  104. RBPB: Regularization-based pattern balancing method for event extraction. In Proceedings of ACL, pages 1224–1234.
  105. Hierarchical Chinese legal event extraction via pedal attention mechanism. In Proceedings of COLING, pages 100–113.
  106. Adaptive knowledge-enhanced Bayesian meta-learning for few-shot event detection. In Findings of ACL-IJCNLP, pages 2417–2429.
  107. CasEE: A joint learning framework with cascade decoding for overlapping event extraction. In Findings of ACL-IJCNLP, pages 164–174.
  108. Literary event detection. In Proceedings of ACL, pages 3623–3634.
  109. From light to rich ere: Annotation of entities, relations, and events. In Proceedings of the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation, pages 89–98.
  110. Cross-lingual structure transfer for relation and event extraction. In Proceedings of EMNLP-IJCNLP, pages 313–325.
  111. Improving event detection via open-domain trigger knowledge. In Proceedings of ACL, pages 5887–5897.
  112. Entity, Relation, and Event Extraction with Contextualized Span Representations. In Proceedings of EMNLP-IJCNLP, pages 5784–5789.
  113. ACE 2005 multilingual training corpus. Linguistic Data Consortium, 57.
  114. Query and extract: Refining event extraction as type-oriented binary decoding. In Findings of ACL, pages 169–182.
  115. Adversarial Training for Weakly Supervised Event Detection. In Proceedings of NAACL-HLT, pages 998–1008.
  116. MAVEN: A Massive General Domain Event Detection Dataset. In Proceedings of EMNLP, pages 1652–1671.
  117. HMEAE: Hierarchical Modular Event Argument Extraction. In Proceedings of EMNLP-IJCNLP, pages 5777–5783.
  118. CLEVE: Contrastive Pre-training for Event Extraction. In Proceedings of ACL-IJCNLP, pages 6283–6297.
  119. Trigger is not sufficient: Exploiting frame-aware knowledge for implicit event argument extraction. In Proceedings of ACL-IJCNLP, pages 4672–4682.
  120. English event detection with translated language features. In Proceedings of ACL, pages 293–298.
  121. DESED: Dialogue-based explanation for sentence-level event detection. In Proceedings of COLING, pages 2483–2493.
  122. Twitter-scale new event detection via k-term hashing. In Proceedings of EMNLP, pages 2584–2589.
  123. Capturing event argument interaction via a bi-directional entity-level recurrent decoder. In Proceedings of ACL-IJCNLP, pages 210–219.
  124. Event detection as graph parsing. In Findings of ACL-IJCNLP, pages 1630–1640.
  125. Document-level event extraction via heterogeneous graph-based interaction model with a tracker. In Proceedings of ACL-IJCNLP, pages 3533–3546.
  126. A two-stream AMR-enhanced model for document-level event argument extraction. In Proceedings of NAACL-HLT, pages 5025–5036.
  127. Detecting cybersecurity events from noisy short text. In Proceedings of NAACL-HLT, pages 1366–1372.
  128. Event Detection with Multi-Order Graph Convolution and Aggregated Attention. In Proceedings of EMNLP-IJCNLP, pages 5766–5770.
  129. Bishan Yang and Tom M. Mitchell. 2016. Joint extraction of events and entities within a document context. In Proceedings of NAACL-HLT, pages 289–299.
  130. Document-level event extraction via parallel prediction networks. In Proceedings of ACL-IJCNLP, pages 6298–6308.
  131. Exploring pre-trained language models for event extraction and generation. In Proceedings of ACL, pages 5284–5294.
  132. LEVEN: A large-scale chinese legal event detection dataset. In Findings of ACL, pages 183–201.
  133. Lifelong event detection with knowledge transfer. In Proceedings of EMNLP, pages 5278–5290.
  134. EA22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTE: Improving consistency with event awareness for document-level argument extraction. In Findings of NAACL, pages 2649–2655.
  135. ASER: A large-scale eventuality knowledge graph. In Proceedings of WWW, pages 201–211.
  136. Zero-shot Label-aware Event Trigger and Argument Classification. In Findings of ACL-IJCNLP, pages 1331–1340.
  137. Zero-shot event detection based on ordered contrastive learning and prompt-based prediction. In Findings of NAACL-HLT, pages 2572–2580.
  138. A two-step approach for implicit event argument detection. In Proceedings of ACL, pages 7479–7485.
  139. Zixuan Zhang and Heng Ji. 2021. Abstract Meaning Representation guided graph encoding and decoding for joint information extraction. In Proceedings of NAACL-HLT, pages 39–49.
  140. Doc2EDAG: An end-to-end document-level framework for Chinese financial event extraction. In Proceedings of EMNLP-IJCNLP, pages 337–346.
  141. Hanzhang Zhou and Kezhi Mao. 2022. Document-level event argument extraction by leveraging redundant information and closed boundary loss. In Proceedings of NAACL-HLT, pages 3041–3052.
  142. A multi-format transfer learning model for event argument extraction via variational information bottleneck. In Proceedings of COLING, pages 1990–2000.
  143. What the role is vs. what plays the role: Semi-supervised event argument extraction via dual question answering. In Proceedings of AAAI, volume 35, pages 14638–14646.
Citations (16)

Summary

  • The paper reveals that variations in data preprocessing lead to non-comparable results in event extraction studies.
  • The paper demonstrates that inconsistent output spaces from different modeling paradigms create evaluation challenges.
  • The paper advocates for standardized pipeline evaluations using the OmniEvent framework to ensure realistic performance benchmarks.

Analysis of "The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation"

The paper "The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation" critically examines the evaluation methodologies in event extraction (EE). It emphasizes the challenges involved with evaluating EE systems due to inherent discrepancies in data preprocessing, output spaces, and evaluation practices, specifically highlighting the absence of pipeline evaluation.

Key Pitfalls in EE Evaluation

1. Data Preprocessing Discrepancy:

The paper identifies that differences in data preprocessing methods lead to non-comparable evaluation results. This occurs because EE datasets have complex, heterogeneous data formats involving elements like triggers, arguments, and entities. The authors note significant statistical differences in datasets—such as ACE 2005—caused by various preprocessing scripts. They emphasize that most EE research does not specify preprocessing steps, leading to a lack of reproducibility and comparability.

2. Output Space Discrepancy:

The paper highlights the inconsistencies in output spaces across different EE models due to varied modeling paradigms, such as classification, sequence labeling, and conditional generation. The paradigms produce differing output forms that result in incompatible evaluation metrics. This is compounded by issues in mapping predictions to annotations, which can significantly alter evaluation outcomes.

3. Absence of Pipeline Evaluation:

The authors underscore a gap between event detection (ED) and event argument extraction (EAE) research, partly due to EAE studies often evaluating systems using gold triggers, thus ignoring errors introduced in previous pipeline stages. This results in evaluations that may not reflect real-world scenarios, where triggers are predicted rather than given.

Proposed Remedies

The paper proposes remedies to address these pitfalls:

  • Specifying Data Preprocessing: Advocating for standardized preprocessing methods and increased transparency about data handling in EE research to enhance result comparability.
  • Standardizing Outputs: Introducing a method to align output spaces across different paradigms, helping to ensure consistency in evaluation metrics.
  • Providing Pipeline Evaluation Results: Encouraging the inclusion of pipeline evaluations in EAE studies to assess system performance under realistic conditions.

OmniEvent Framework

To support the adoption of these remedies, the authors developed OmniEvent, a consistent evaluation framework. This framework provides preprocessing scripts for widely-used datasets, standardizes model outputs, and releases pre-trained triggers to facilitate consistent pipeline evaluations in future research.

Implications and Future Directions

This paper's contributions have significant implications for the EE community. By addressing these evaluation pitfalls, the research promotes more reliable, consistent benchmarks. This, in turn, could stimulate advancements in EE models by enabling accurate comparisons across diverse approaches.

Future research might extend the scope of this investigation to emerging datasets and languages, further refining evaluation consistency. Additionally, exploring methods to incorporate more complex, real-world scenarios into evaluations could enhance the robustness of EE systems.

In summary, this paper serves as a critical resource for improving EE evaluation methodologies, advocating for clearer benchmarks and reproducibility in the field.

Github Logo Streamline Icon: https://streamlinehq.com