- The paper presents a new discourse-level dataset with 726 articles and over 29,000 sentences to enhance NER and relation extraction in Chinese literature.
- It introduces two tagging frameworks—heuristic and machine auxiliary tagging—to improve annotation consistency and reduce workload.
- Experimental outcomes demonstrate that CRF outperforms Bi-LSTM for NER, setting a benchmark for future research in literary text analysis.
Discourse-Level Named Entity Recognition and Relation Extraction for Chinese Literature
This paper addresses the complex task of Named Entity Recognition (NER) and Relation Extraction (RE) within the context of Chinese literature. Due to the intricacies of rhetorical devices and the lack of adequate datasets, this task presents significant challenges. The authors present a new discourse-level dataset derived from hundreds of Chinese literary articles to advance research in this domain.
Dataset Construction
The dataset encompasses approximately 726 Chinese literature articles, comprising over 29,000 sentences and 100,000 characters. The annotation process spanned 300 person-hours, engaging five annotators over three months. Unlike conventional sentence-level datasets, this discourse-level dataset benefits from additional context information provided by interconnected passages.
To tackle the consistency issues in annotation, the authors proposed two methods:
- Heuristic Tagging: This method involves the application of generic disambiguating rules to maintain consistency across annotations. For example, adjectives are removed from entities, simplifying annotation guidelines and reducing data sparsity issues.
- Machine Auxiliary Tagging: This approach uses machine learning to assist annotators by predicting labels on a subset of the corpus based on learned annotation standards. Discrepancies between machine-predicted and human-annotated tags highlight areas requiring additional attention, thereby reducing annotator workload.
Tagging Framework
The paper introduces seven entity labels and nine relation labels tailored to literary text. Notably, new categories such as "Thing," "Time," and "Metric" enhance the model's ability to parse nuanced literature language. The entity and relation tags are detailed with examples, contributing to the model’s robustness.
Experimental Outcomes
The paper evaluates several models on the proposed dataset, including Bi-LSTM and CRF for NER tasks. CRF outperformed Bi-LSTM, demonstrating superior accuracy, a result possibly due to the feature-rich template used by CRF.
For RE tasks, the paper compares various neural network architectures such as RNNs, CNNs, and LSTMs, indicating different levels of performance with nuanced embeddings and additional semantic features. Widely used models, including the recursive neural network and CNN-based methods, provide baseline metrics for this task.
Implications and Future Directions
This work lays the foundation for future research endeavors by providing a discourse-level dataset explicitly designed for Chinese literature. The introduction of adaptable tagging methodologies enhances dataset consistency, which is crucial for the complexity inherent in literary texts.
This research opens avenues for applying the dataset to advanced NER and RE models, potentially improving the understanding and analysis of literary documents. Future developments could explore more sophisticated machine learning techniques, like transformer-based models, to capture the intricate relations and entities present in literature more effectively.
The paper provides an essential dataset that acts as both a benchmark and a resource for advancing NER and RE tasks in the unique and challenging field of Chinese literary text analysis.