Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction
This paper presents a novel framework, Doc2EDAG, designed to address the challenges of document-level event extraction (DEE) in applications such as finance, where event arguments are often dispersed across different sentences and multiple events coexist within the same document. The authors notably shift the focus from traditional sentence-level event extraction methods to a document-level approach, developing an innovative model that leverages an entity-based directed acyclic graph (EDAG) to perform end-to-end DEE tasks.
Key Contributions
- Entity-Based Directed Acyclic Graph (EDAG): The introduction of EDAGs transforms the event data into a more structured format, enabling the model to tackle the complex task of table filling through a series of sequential path-expanding sub-tasks. This approach simplifies the extraction process and enhances the model's ability to capture the dispersed arguments across a document.
- No-Trigger-Words Design: By eliminating the dependence on trigger words for event detection, the authors reformalize the DEE task to focus directly on filling event tables. This design facilitates easier document-level event labeling through distant supervision (DS), without relying on predefined trigger word sets.
- Document-Level Entity Encoding: To address the arguments-scattering challenge, Doc2EDAG encodes entities with document-level context, ensuring that the model considers the full context of the document in which an entity appears, rather than limiting the scope to individual sentences.
- Memory Mechanism: An innovative memory mechanism is introduced to support path expansion, maintaining a history of extracted entities and enhancing the model’s capability to address multi-event and scattered arguments scenarios.
Experimental Results
The authors conducted experiments on a comprehensive Chinese financial announcements dataset, significantly larger than previously available datasets. The results demonstrate that Doc2EDAG outperforms existing state-of-the-art methods, with notable improvements in precision, recall, and F1 scores across several event types. Specifically, the model shows strong performance on both single-event and multi-event documents, a critical improvement given the complexity of real-world applications.
Practical and Theoretical Implications
The development of Doc2EDAG holds significant practical implications for real-world applications in domains such as finance, legislation, and healthcare, where extracting structured information from documents is critical. Theoretically, this approach opens up new avenues for DEE research, suggesting that similar methods could be applied to other languages or domains with minimal domain-specific modifications.
Future Directions
The authors suggest that future research might explore expanding the input formats beyond plain text to include richly formatted documents, further enhancing the model's utility in diverse practical settings. The framework's adaptability offers promising potential for enhancing artificial intelligence systems tasked with understanding and structuring complex document-level information.
In conclusion, the Doc2EDAG framework presents a significant advancement in document-level event extraction, offering a robust methodology that effectively addresses the inherent challenges of multi-event and arguments-scattering in large-scale document corpora.