Papers
Topics
Authors
Recent
Search
2000 character limit reached

CREER Dataset for NLP Relation Extraction

Updated 1 February 2026
  • CREER dataset is a large-scale corpus designed for relation extraction and entity recognition, featuring rich semantic and syntactic annotations from Wikipedia text.
  • It employs the Stanford CoreNLP pipeline to systematically annotate 2.5 billion words with details like POS tags, NER, constituency parses, dependency graphs, and TAC-KBP relations.
  • Its vast scale, with over 144 million sentences and 60 million relation triplets, underpins advances in pre-training, knowledge-rich representation, and benchmark evaluations in NLP.

The CREER dataset is a large-scale corpus designed for relation extraction and entity recognition in English text. Originating from Wikipedia plain text, it leverages the Stanford CoreNLP Annotator to provide sentences richly annotated with syntactic and semantic attributes, enabling advanced natural language processing workflows. Its annotations follow established linguistic conventions, incorporating document-level segmentation and granular sentence-level features, which collectively support a broad range of NLP tasks including entity recognition, relation extraction, and grammar-aware modeling (Tang et al., 2022).

1. Source Data and Preprocessing

CREER is constructed from an English Wikipedia dump, approximately 2.5 billion words, sourced from https://dumps.wikimedia.org/. Preprocessing steps systematically remove lists, tables, URLs, and metadata, retaining exclusively the plain-text paragraphs for annotation. At the document level, the original Wikipedia article is split into continuous sentences, yielding a sentence-oriented architecture. Each sentence acts as an independent record for semantic and syntactic annotation.

2. Annotation Schema and Automated Pipeline

All sentence annotations in CREER are produced using the Stanford CoreNLP (version 3.9+) pipeline, invoked via a Python wrapper (stanfordcorenlp). The workflow sequentially applies tokenize, ssplit, pos, lemma, ner, parse, depparse, and kbp modules, with no manual corrections reported.

Annotated Attributes

  • Tokenize: Identification of word tokens with character-offset begin/end.
  • POS: Penn Treebank part-of-speech tags (e.g., NN, NNP, VB).
  • NER: Named-entity recognition, comprising PERSON, LOCATION, ORGANIZATION, MISC, numeric (MONEY, NUMBER, PERCENT, SET), and temporal (DATE, TIME, DURATION) categories.
  • Parse: Full constituency parse tree in Penn Treebank bracketed format.
  • BasicDep: Basic dependency graphs, specifying governor, dependent, and relation type.
  • EnhancedDep, EnhancedPlusPlusDep: Higher-level dependency representations for semantic roles.
  • KBP Relations: Knowledge-base population relation triplets per the TAC-KBP standard; relation instances are triples (subjectEntity,relationType,objectEntity)(\text{subjectEntity}, \text{relationType}, \text{objectEntity}) with relation types drawn from approximately 30 slot-filling TAC-KBP relations (per:origin, per:title, org:founders, org:members, etc.).

The entity and relation taxonomy adheres to established standards, supporting PERSON, ORGANIZATION, LOCATION, MISC, and several numerical and temporal classes. Relation types are based on the fixed TAC-KBP taxonomy.

Annotation is entirely automatic; no inter-annotator agreement (e.g., Cohen’s κ\kappa) is assessed, as the dataset relies solely on automated annotation protocols.

3. Data Format and Corpus Statistics

CREER is distributed as JSON files reflecting the Stanford CoreNLP JSON schema. Each sentence-level record contains:

  • tokens: List of token dictionaries (word, index, characterOffsetBegin, characterOffsetEnd, pos, ner)
  • parseTree: Penn Treebank-style bracketed tree string
  • basicDependencies: List of dependency edges (dep, governor, governorGloss, dependent, dependentGloss)
  • enhancedDependencies, enhancedPlusPlusDependencies: Advanced dependency structures
  • entityMentions: List specifying entity mention spans and metadata
  • kbpRelations: List of relation triplets (subject, subjectSpan, relation, object, objectSpan)

Formal relation definition (informal): Let ss be a sentence containing entity mentions e1,e2e_1, e_2; then R(e1,e2)=rR(e_1, e_2) = r iff the pipeline assigns relation label r∈r \in TAC-KBP for e1,e2e_1, e_2 within ss.

Quantitative statistics:

Statistic Count
Sentences 144,732,654
Tokens (approx.) 2.5 billion
Entity mentions 371,186,870
Relation triplets 60,503,288

4. Comparative Corpus Scale

CREER is situated at the extreme end of large-scale corpora for linguistic annotation. A comparative table illustrates its scope:

Dataset #Sent. #Entities #Relations
CoNLL-2003 20,744 35,089 –
SemEval-2010 Task 8 10,717 – 21,437
OntoNotes 5.0 94,268 1,166,513 543,534
CREER (wiki) 144,732,654 371,186,870 60,503,288

This suggests CREER’s suitability for ambitious tasks in neural pre-training, knowledge-rich representation learning, and corpus-level benchmarking, given its scale is orders of magnitude beyond existing widely-used datasets.

5. Intended Applications and Use Cases

CREER is designed primarily as a supervised resource for:

No end-task benchmarks or experimental results are reported in the original paper; CREER is presented as a foundational data resource rather than a direct evaluation platform. A plausible implication is that researchers must conduct independent downstream evaluation or benchmarking to quantify CREER’s practical impact.

6. Accessibility and Licensing

CREER is publicly accessible via https://140.116.82.111/share.cgi?ssid=000dOJ4. The dataset’s licensing terms are not explicitly stated within the paper; it is presumed free for academic research, but users are advised to contact the authors for definitive legal status and terms of use (Tang et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CREER Dataset.