Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text (1711.07010v5)

Published 19 Nov 2017 in cs.CL

Abstract: Named Entity Recognition and Relation Extraction for Chinese literature text is regarded as the highly difficult problem, partially because of the lack of tagging sets. In this paper, we build a discourse-level dataset from hundreds of Chinese literature articles for improving this task. To build a high quality dataset, we propose two tagging methods to solve the problem of data inconsistency, including a heuristic tagging method and a machine auxiliary tagging method. Based on this corpus, we also introduce several widely used models to conduct experiments. Experimental results not only show the usefulness of the proposed dataset, but also provide baselines for further research. The dataset is available at https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset

Citations (54)

Summary

  • The paper presents a new discourse-level dataset with 726 articles and over 29,000 sentences to enhance NER and relation extraction in Chinese literature.
  • It introduces two tagging frameworks—heuristic and machine auxiliary tagging—to improve annotation consistency and reduce workload.
  • Experimental outcomes demonstrate that CRF outperforms Bi-LSTM for NER, setting a benchmark for future research in literary text analysis.

Discourse-Level Named Entity Recognition and Relation Extraction for Chinese Literature

This paper addresses the complex task of Named Entity Recognition (NER) and Relation Extraction (RE) within the context of Chinese literature. Due to the intricacies of rhetorical devices and the lack of adequate datasets, this task presents significant challenges. The authors present a new discourse-level dataset derived from hundreds of Chinese literary articles to advance research in this domain.

Dataset Construction

The dataset encompasses approximately 726 Chinese literature articles, comprising over 29,000 sentences and 100,000 characters. The annotation process spanned 300 person-hours, engaging five annotators over three months. Unlike conventional sentence-level datasets, this discourse-level dataset benefits from additional context information provided by interconnected passages.

To tackle the consistency issues in annotation, the authors proposed two methods:

  1. Heuristic Tagging: This method involves the application of generic disambiguating rules to maintain consistency across annotations. For example, adjectives are removed from entities, simplifying annotation guidelines and reducing data sparsity issues.
  2. Machine Auxiliary Tagging: This approach uses machine learning to assist annotators by predicting labels on a subset of the corpus based on learned annotation standards. Discrepancies between machine-predicted and human-annotated tags highlight areas requiring additional attention, thereby reducing annotator workload.

Tagging Framework

The paper introduces seven entity labels and nine relation labels tailored to literary text. Notably, new categories such as "Thing," "Time," and "Metric" enhance the model's ability to parse nuanced literature language. The entity and relation tags are detailed with examples, contributing to the model’s robustness.

Experimental Outcomes

The paper evaluates several models on the proposed dataset, including Bi-LSTM and CRF for NER tasks. CRF outperformed Bi-LSTM, demonstrating superior accuracy, a result possibly due to the feature-rich template used by CRF.

For RE tasks, the paper compares various neural network architectures such as RNNs, CNNs, and LSTMs, indicating different levels of performance with nuanced embeddings and additional semantic features. Widely used models, including the recursive neural network and CNN-based methods, provide baseline metrics for this task.

Implications and Future Directions

This work lays the foundation for future research endeavors by providing a discourse-level dataset explicitly designed for Chinese literature. The introduction of adaptable tagging methodologies enhances dataset consistency, which is crucial for the complexity inherent in literary texts.

This research opens avenues for applying the dataset to advanced NER and RE models, potentially improving the understanding and analysis of literary documents. Future developments could explore more sophisticated machine learning techniques, like transformer-based models, to capture the intricate relations and entities present in literature more effectively.

The paper provides an essential dataset that acts as both a benchmark and a resource for advancing NER and RE tasks in the unique and challenging field of Chinese literary text analysis.