Linked WikiText-2 Dataset Overview
- Linked WikiText-2 is an annotated language modeling dataset that aligns Wikipedia articles with Wikidata, linking tokens to real-world entities, dates, and quantities.
- It uses a multi-stage distant supervision pipeline—including neural entity linking and coreference resolution—to enhance coverage and accurately annotate tokens.
- The dataset enables language models to dynamically access external knowledge, resulting in measurable improvements in perplexity and factual completion.
The Linked WikiText-2 dataset is an annotated language modeling corpus constructed by aligning Wikipedia articles with the Wikidata knowledge graph. Developed as a replacement for the original WikiText-2 benchmark, it links tokens corresponding to real-world entities, dates, and numeric values to their canonical Wikidata Q-IDs. This enables the training and evaluation of LLMs that dynamically access external structured knowledge for improved factual accuracy, out-of-vocabulary generation, and controlled reasoning over both text and knowledge graphs (IV et al., 2019).
1. Scope and Objectives
Linked WikiText-2 was designed to support fact-aware language modeling by providing explicit links between natural language text and entries in an external knowledge graph (KG). The dataset provides, for approximately 10% of all tokens, not only the surface form (token) but also the Wikidata Q-ID, the specific relation (such as "publication date" or "birthPlace") used to justify the mention, and the parent entity in the relevant local subgraph. The dataset comprises nearly the same set of Wikipedia articles and standardized train/validation/test splits as WikiText-2 (∼2 million tokens in training, ∼200,000 in each validation and test split), thus allowing a direct comparison of models trained with and without access to explicit knowledge annotations.
The key innovation is to enable LLMs to “reach into” Wikidata at generation time, supporting the production of rare or previously unseen factual tokens, in contrast to relying on memorized content or fallback unknown-token mechanisms.
2. Construction and Annotation Pipeline
Linked WikiText-2 is generated via a multi-stage distant-supervision pipeline incorporating automated and manual signal:
a) Article Preprocessing: Each Wikipedia article is tokenized, segmented, and candidate spans are identified.
b) Initial Entity Linking:
- Extraction of all spans manually linked by Wikipedia editors via HTML source (i.e., inline links).
- Application of the neural-el system (Gupta et al. 2017) to detect additional mentions not explicitly linked by editors.
- Use of Stanford CoreNLP coreference resolution to associate pronouns/nominals with correct entities.
c) Local Knowledge Graph Construction: For each newly-introduced Q-ID, all one-hop neighbors (parent , relation , entity ) are fetched from Wikidata, expanding candidate mentions. If a later token matches any of these entities, it is annotated as "related," recording the parent entity and relation for context. A self-loop relation ("Reflexive") marks repeated entity mentions.
d) String-Matching of Dates and Quantities: Surface forms for all Wikidata date and quantity values are generated and matched to the text, assigning the corresponding Q-ID and truncating or normalizing precision according to Wikidata standards.
e) Final Cleanup: Pruning stale entities (out of context window), resolving overlapping mention spans, and discarding spurious matches.
Preprocessing for tokenization and sentence segmentation follows the original WikiText-2 conventions. Out-of-vocabulary tokens, relative to the base 33,000-token vocabulary, are either mapped to a Wikidata alias (if present) or assigned to the standard UNK token.
3. Dataset Content, Format, and Statistics
Linked WikiText-2 is released as separate JSONL files for train, validation, and test splits. Each line corresponds to one document and contains token and mention-level annotations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
{
"doc_id": string,
"tokens": [ "Super", "Mario", "Land", ... ],
"mentions": [
{
"start": 0,
"end": 3,
"qid": "Q647249",
"type": "new",
"parent_qid": null,
"rel": null
},
{
"start": 6,
"end": 7,
"qid": "Q734818",
"type": "related",
"parent_qid": "Q647249",
"rel": "P577"
},
...
]
} |
Wikidata Q-IDs are provided in "Q…" string format; relations are given as property P-numbers. Each token span annotated records its entity type: "new," "related," or "∅" (no entity).
Statistics:
| Split | Documents | Tokens | Mention-tokens | Mention-spans | Unique entities | Unique relations |
|---|---|---|---|---|---|---|
| Train | 600 | 2,019,195 | 207,803 | 122,983 | 41,058 | 1,291 |
| Validation | 60 | 207,982 | 21,226 | 12,214 | 5,415 | 484 |
| Test | 60 | 236,062 | 24,441 | 15,007 | 5,625 | 504 |
Approximately 10% of all tokens mark an entity, date, or quantity. The training split alone covers over 41,000 unique entities and 1,291 relation types. Each linked entity is observed, on average, fewer than 5 times, reflecting a strong long-tail distribution. Relation types include commonly used ones such as "instance of," "subclass of," "publication date," and "birthPlace" (IV et al., 2019).
4. Quality Assessment and Annotation Coverage
No explicit human-annotated gold standard is reported for precision/recall evaluation of the annotation pipeline. However, specific coverage metrics include:
- Nearly 100% capture of the first occurrence of each human-inserted Wikipedia link, by construction.
- The neural-el system increases entity mention coverage by an estimated 10–20%.
- Coreference resolution collects pronouns and nominals missed by direct linking.
- Automatic string-matching covers more than 90% of dates and quantities (informal spot checks estimate precision at above 95%).
If annotation quality is defined as:
the authors state that, despite omissions and mistakes, overall coverage is “high” and annotation quality is sufficient for training the Knowledge Graph LLM (KGLM) to achieve strong gains in perplexity-based evaluation. No detailed breakdown (e.g., F₁ by mention type) is reported (IV et al., 2019).
5. Distinctive Features and Limitations
Linked WikiText-2 differs from traditional entity-linking datasets (e.g., AIDA, TAC-KBP, ACE) in several key ways:
- It links not only named entities but also dates, quantities, and generic relations, providing richer factual context.
- Document coverage consists of contiguous, full-article Wikipedia texts, rather than isolated newswire sentences or short documents.
- Annotations align to a broad, cross-topic open-domain knowledge graph (Wikidata) and encompass varied relation types not restricted to named entities.
The dataset is designed for direct applicability to language modeling, with contiguous training/test splits suitable for natural text continuation and ready evaluation under standard metrics such as perplexity or factual completion.
Limitations include:
- Annotation is based on distant supervision, introducing possible linkage errors—e.g., entities not represented or correctly linked in Wikidata yield spurious or missing annotations.
- There is no held-out human-annotated gold set for intrinsic evaluation of entity-linking scores by category.
- The knowledge coverage reflects the state of Wikidata at the crawl time; subsequent additions are not retroactively annotated.
6. Applications and Comparison to Related Resources
The dataset enables targeted research on fact-aware language modeling, specifically facilitating models such as the KGLM that can access and copy knowledge graph facts relevant to the generation context. Experiments demonstrate that such models outperform standard LSTM-based LLMs and even very large neural LMs on tasks involving factual completion, out-of-vocabulary handling, and perplexity reduction (IV et al., 2019).
Compared with other benchmarks:
- Datasets like AIDA, TAC-KBP, or ACE are typically restricted to named entities and newswire genres, lack systematic coverage of quantities/dates, and are annotated on shorter spans.
- Linked WikiText-2 is aligned to a large, freely accessible KG, is open for research use, and is specifically structured for evaluating LM–KG integration.
7. Accessibility and Usage
Linked WikiText-2 is released in JSONL format (one document per line) for each train/dev/test split and is freely available, with no licensing restrictions, for download at [https://rloganiv.github.io/linked-wikitext-2]. Each instance provides both surface tokens and richly aligned Wikidata annotations, enabling reproducible research at the intersection of knowledge-based language generation, structured prediction, and factual reasoning (IV et al., 2019).