CREER Dataset for NLP Relation Extraction
- CREER dataset is a large-scale corpus designed for relation extraction and entity recognition, featuring rich semantic and syntactic annotations from Wikipedia text.
- It employs the Stanford CoreNLP pipeline to systematically annotate 2.5 billion words with details like POS tags, NER, constituency parses, dependency graphs, and TAC-KBP relations.
- Its vast scale, with over 144 million sentences and 60 million relation triplets, underpins advances in pre-training, knowledge-rich representation, and benchmark evaluations in NLP.
The CREER dataset is a large-scale corpus designed for relation extraction and entity recognition in English text. Originating from Wikipedia plain text, it leverages the Stanford CoreNLP Annotator to provide sentences richly annotated with syntactic and semantic attributes, enabling advanced natural language processing workflows. Its annotations follow established linguistic conventions, incorporating document-level segmentation and granular sentence-level features, which collectively support a broad range of NLP tasks including entity recognition, relation extraction, and grammar-aware modeling (Tang et al., 2022).
1. Source Data and Preprocessing
CREER is constructed from an English Wikipedia dump, approximately 2.5 billion words, sourced from https://dumps.wikimedia.org/. Preprocessing steps systematically remove lists, tables, URLs, and metadata, retaining exclusively the plain-text paragraphs for annotation. At the document level, the original Wikipedia article is split into continuous sentences, yielding a sentence-oriented architecture. Each sentence acts as an independent record for semantic and syntactic annotation.
2. Annotation Schema and Automated Pipeline
All sentence annotations in CREER are produced using the Stanford CoreNLP (version 3.9+) pipeline, invoked via a Python wrapper (stanfordcorenlp). The workflow sequentially applies tokenize, ssplit, pos, lemma, ner, parse, depparse, and kbp modules, with no manual corrections reported.
Annotated Attributes
- Tokenize: Identification of word tokens with character-offset begin/end.
- POS: Penn Treebank part-of-speech tags (e.g., NN, NNP, VB).
- NER: Named-entity recognition, comprising PERSON, LOCATION, ORGANIZATION, MISC, numeric (MONEY, NUMBER, PERCENT, SET), and temporal (DATE, TIME, DURATION) categories.
- Parse: Full constituency parse tree in Penn Treebank bracketed format.
- BasicDep: Basic dependency graphs, specifying governor, dependent, and relation type.
- EnhancedDep, EnhancedPlusPlusDep: Higher-level dependency representations for semantic roles.
- KBP Relations: Knowledge-base population relation triplets per the TAC-KBP standard; relation instances are triples with relation types drawn from approximately 30 slot-filling TAC-KBP relations (per:origin, per:title, org:founders, org:members, etc.).
The entity and relation taxonomy adheres to established standards, supporting PERSON, ORGANIZATION, LOCATION, MISC, and several numerical and temporal classes. Relation types are based on the fixed TAC-KBP taxonomy.
Annotation is entirely automatic; no inter-annotator agreement (e.g., Cohen’s ) is assessed, as the dataset relies solely on automated annotation protocols.
3. Data Format and Corpus Statistics
CREER is distributed as JSON files reflecting the Stanford CoreNLP JSON schema. Each sentence-level record contains:
tokens: List of token dictionaries (word, index, characterOffsetBegin, characterOffsetEnd, pos, ner)parseTree: Penn Treebank-style bracketed tree stringbasicDependencies: List of dependency edges (dep, governor, governorGloss, dependent, dependentGloss)enhancedDependencies,enhancedPlusPlusDependencies: Advanced dependency structuresentityMentions: List specifying entity mention spans and metadatakbpRelations: List of relation triplets (subject, subjectSpan, relation, object, objectSpan)
Formal relation definition (informal): Let be a sentence containing entity mentions ; then iff the pipeline assigns relation label TAC-KBP for within .
Quantitative statistics:
| Statistic | Count |
|---|---|
| Sentences | 144,732,654 |
| Tokens (approx.) | 2.5 billion |
| Entity mentions | 371,186,870 |
| Relation triplets | 60,503,288 |
4. Comparative Corpus Scale
CREER is situated at the extreme end of large-scale corpora for linguistic annotation. A comparative table illustrates its scope:
| Dataset | #Sent. | #Entities | #Relations |
|---|---|---|---|
| CoNLL-2003 | 20,744 | 35,089 | – |
| SemEval-2010 Task 8 | 10,717 | – | 21,437 |
| OntoNotes 5.0 | 94,268 | 1,166,513 | 543,534 |
| CREER (wiki) | 144,732,654 | 371,186,870 | 60,503,288 |
This suggests CREER’s suitability for ambitious tasks in neural pre-training, knowledge-rich representation learning, and corpus-level benchmarking, given its scale is orders of magnitude beyond existing widely-used datasets.
5. Intended Applications and Use Cases
CREER is designed primarily as a supervised resource for:
- Pre-training contextual representations with integrated world knowledge
- Benchmarking named entity recognition (NER), constituency parsing, dependency parsing, semantic role labeling extensions, and relation extraction
No end-task benchmarks or experimental results are reported in the original paper; CREER is presented as a foundational data resource rather than a direct evaluation platform. A plausible implication is that researchers must conduct independent downstream evaluation or benchmarking to quantify CREER’s practical impact.
6. Accessibility and Licensing
CREER is publicly accessible via https://140.116.82.111/share.cgi?ssid=000dOJ4. The dataset’s licensing terms are not explicitly stated within the paper; it is presumed free for academic research, but users are advised to contact the authors for definitive legal status and terms of use (Tang et al., 2022).