CREER Dataset for NLP Relation Extraction

Updated 1 February 2026

CREER dataset is a large-scale corpus designed for relation extraction and entity recognition, featuring rich semantic and syntactic annotations from Wikipedia text.
It employs the Stanford CoreNLP pipeline to systematically annotate 2.5 billion words with details like POS tags, NER, constituency parses, dependency graphs, and TAC-KBP relations.
Its vast scale, with over 144 million sentences and 60 million relation triplets, underpins advances in pre-training, knowledge-rich representation, and benchmark evaluations in NLP.

The CREER dataset is a large-scale corpus designed for relation extraction and entity recognition in English text. Originating from Wikipedia plain text, it leverages the Stanford CoreNLP Annotator to provide sentences richly annotated with syntactic and semantic attributes, enabling advanced natural language processing workflows. Its annotations follow established linguistic conventions, incorporating document-level segmentation and granular sentence-level features, which collectively support a broad range of NLP tasks including entity recognition, relation extraction, and grammar-aware modeling (Tang et al., 2022).

1. Source Data and Preprocessing

CREER is constructed from an English Wikipedia dump, approximately 2.5 billion words, sourced from https://dumps.wikimedia.org/. Preprocessing steps systematically remove lists, tables, URLs, and metadata, retaining exclusively the plain-text paragraphs for annotation. At the document level, the original Wikipedia article is split into continuous sentences, yielding a sentence-oriented architecture. Each sentence acts as an independent record for semantic and syntactic annotation.

2. Annotation Schema and Automated Pipeline

All sentence annotations in CREER are produced using the Stanford CoreNLP (version 3.9+) pipeline, invoked via a Python wrapper (stanfordcorenlp). The workflow sequentially applies tokenize, ssplit, pos, lemma, ner, parse, depparse, and kbp modules, with no manual corrections reported.

Annotated Attributes

Tokenize: Identification of word tokens with character-offset begin/end.
POS: Penn Treebank part-of-speech tags (e.g., NN, NNP, VB).
NER: Named-entity recognition, comprising PERSON, LOCATION, ORGANIZATION, MISC, numeric (MONEY, NUMBER, PERCENT, SET), and temporal (DATE, TIME, DURATION) categories.
Parse: Full constituency parse tree in Penn Treebank bracketed format.
BasicDep: Basic dependency graphs, specifying governor, dependent, and relation type.
EnhancedDep, EnhancedPlusPlusDep: Higher-level dependency representations for semantic roles.
KBP Relations: Knowledge-base population relation triplets per the TAC-KBP standard; relation instances are triples $(\text{subjectEntity}, \text{relationType}, \text{objectEntity})$ with relation types drawn from approximately 30 slot-filling TAC-KBP relations (per:origin, per:title, org:founders, org:members, etc.).

The entity and relation taxonomy adheres to established standards, supporting PERSON, ORGANIZATION, LOCATION, MISC, and several numerical and temporal classes. Relation types are based on the fixed TAC-KBP taxonomy.

Annotation is entirely automatic; no inter-annotator agreement (e.g., Cohen’s $\kappa$ ) is assessed, as the dataset relies solely on automated annotation protocols.

3. Data Format and Corpus Statistics

CREER is distributed as JSON files reflecting the Stanford CoreNLP JSON schema. Each sentence-level record contains:

tokens: List of token dictionaries (word, index, characterOffsetBegin, characterOffsetEnd, pos, ner)
parseTree: Penn Treebank-style bracketed tree string
basicDependencies: List of dependency edges (dep, governor, governorGloss, dependent, dependentGloss)
enhancedDependencies, enhancedPlusPlusDependencies: Advanced dependency structures
entityMentions: List specifying entity mention spans and metadata
kbpRelations: List of relation triplets (subject, subjectSpan, relation, object, objectSpan)

Formal relation definition (informal): Let $s$ be a sentence containing entity mentions $e_1, e_2$ ; then $R(e_1, e_2) = r$ iff the pipeline assigns relation label $r \in$ TAC-KBP for $e_1, e_2$ within $s$ .

Quantitative statistics:

Statistic	Count
Sentences	144,732,654
Tokens (approx.)	2.5 billion
Entity mentions	371,186,870
Relation triplets	60,503,288

4. Comparative Corpus Scale

CREER is situated at the extreme end of large-scale corpora for linguistic annotation. A comparative table illustrates its scope:

Dataset	#Sent.	#Entities	#Relations
CoNLL-2003	20,744	35,089	–
SemEval-2010 Task 8	10,717	–	21,437
OntoNotes 5.0	94,268	1,166,513	543,534
CREER (wiki)	144,732,654	371,186,870	60,503,288

This suggests CREER’s suitability for ambitious tasks in neural pre-training, knowledge-rich representation learning, and corpus-level benchmarking, given its scale is orders of magnitude beyond existing widely-used datasets.

5. Intended Applications and Use Cases

CREER is designed primarily as a supervised resource for:

Pre-training contextual representations with integrated world knowledge
Benchmarking named entity recognition (NER), constituency parsing, dependency parsing, semantic role labeling extensions, and relation extraction

No end-task benchmarks or experimental results are reported in the original paper; CREER is presented as a foundational data resource rather than a direct evaluation platform. A plausible implication is that researchers must conduct independent downstream evaluation or benchmarking to quantify CREER’s practical impact.

6. Accessibility and Licensing

CREER is publicly accessible via https://140.116.82.111/share.cgi?ssid=000dOJ4. The dataset’s licensing terms are not explicitly stated within the paper; it is presumed free for academic research, but users are advised to contact the authors for definitive legal status and terms of use (Tang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

CREER: A Large-Scale Corpus for Relation Extraction and Entity Recognition (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CREER Dataset.