WikiText Dataset Overview

Updated 16 March 2026

WikiText is a high-quality language modeling corpus derived from curated Wikipedia articles, ensuring factual reliability and rich linguistic data.
It employs systematic tokenization, rigorous preprocessing, and document-level splits to support robust evaluation using metrics like perplexity and UPP.
Extensions such as Linked WikiText and WikiGraphs integrate structured knowledge annotations to enhance rare-word prediction and factual recall in language models.

The WikiText dataset is a suite of large-scale, high-quality language modeling corpora derived from English-language Wikipedia. It is specifically curated to support research on rare-word prediction, long-range dependency modeling, and knowledge-augmented LLMs. Several aligned variants and extensions exist for multilingual settings and integration with structured knowledge graphs.

1. Corpus Construction and Variants

The core WikiText corpora, WikiText-2 and WikiText-103, were introduced in "Pointer Sentinel Mixture Models" (Merity et al., 2016). Articles were sourced from English Wikipedia, strictly filtered by editorial quality—only articles classified as “Good” (23,805) or “Featured” (4,790) were included, ensuring factual reliability and linguistic quality. Text extraction removed complex MediaWiki macros, lists-as-tables-of-contents, and replaced all mathematical/LaTeX content by a single token ‹formula›.

Tokenization uses the Moses tokenizer, with additional numeric splitting (e.g., “8,600” → “8 @,@ 600”) and custom punctuation processing. Article-level splits into train/validation/test ensure zero overlap of content across splits. WikiText-2 comprises 600/60/60 articles across splits, while WikiText-103 includes 28,475/60/60 for train/validation/test, respectively.

Table 1: Core WikiText Dataset Sizes (Merity et al., 2016)

Corpus	Train Articles	Train Tokens	Vocabulary Size	OOV Rate
WikiText-2	600	2,088,628	33,278	2.6%
WikiText-103	28,475	103,227,021	267,735	0.4%

The vocabulary omits tokens with training set frequency below 3; these are uniformly mapped to the special ‹unk› token.

2. Dataset Extensions: Linked WikiText and Multilingual Corpora

Linked WikiText-2

Linked WikiText-2 augments WikiText-2 with dense entity-level annotations aligned to Wikidata, providing a bridge between narrative text and structured KGs ["Barack's Wife Hillary: Using Knowledge-Graphs for Fact-Aware Language Modeling" (IV et al., 2019)]. Raw HTML was used to extract and preserve internal Wikipedia hyperlinks. Entity linking proceeded via:

Gold-standard hyperlinks mapped to Wikidata Q-IDs.
Supplementary neural entity linking (Gupta et al., 2017).
Coreference resolution using Stanford CoreNLP.

As each entity mention is encountered, one-hop Wikidata neighbors are dynamically added to the set of candidates; observed re-occurring links allow capturing (parent, relation, child) triples as plausible mention justifications. Expansion via alias matching and explicit heuristics (for dates and quantities) further increases coverage. Annotation format, per token, tracks mention type, Wikidata entity, possible parent, and relation. The corpus yields 600 train / 60 dev / 60 test documents, with 2,019,195 / 207,982 / 236,062 tokens (train/dev/test). Roughly 10% of all tokens are annotated as entity mentions.

WikiText-TL-39 (Tagalog)

WikiText-TL-39 (Cruz et al., 2019) extends the WikiText methodology to Tagalog, built from all Tagalog Wikipedia articles (approximately 75,000). There is no featured-article filter; all pages with titles A–Z are included. Tokenization follows Unicode normalization and Moses rules. For BERT experiments, a SentencePiece BPE vocabulary of 290k or 30k was used, while ULMFiT experiments used a case-preserving marker and top-30k word-level vocabulary. Table: WikiText-TL-39 Data Statistics

Split	Documents	Tokens	Unique Tokens
Train	120,975	39,267,089	279,153
Validation	25,919	8,356,898	164,159
Test	25,921	8,333,288	175,999

OOV tokens in test constitute 0.102% of the split.

3. Annotation, Preprocessing, and Quality Control

The dataset preserves punctuation, original case, and newlines (as ‹eos›), and includes minimal preprocessing to maintain linguistic fidelity. In Linked WikiText-2, tokens are annotated with:

mention type $t_t \in \{\text{new}, \text{related}, \emptyset\}$
linked Wikidata entity $e_t$ (Q-ID) or $\emptyset$
parent $p_t$ (prior entity, if "related") or $\emptyset$
relation $r_t$ (Wikidata relation) or $\emptyset$

Unlinked tokens have all annotation fields set to $\emptyset$ . All entity and relation embeddings for KG-aligned extensions are pretrained using a 2-hop subgraph and TransE objectives:

$\delta(v_p,v_r,v_e) = \|v_p + v_r – v_e\|^2$

$\mathcal{L} = \max(0, \gamma + \delta(v_p,v_r,v_e) – \delta(v_p',v_r,v_e'))$

Stale (never re-mentioned) entities may be pruned from the local KG. No manual correction was applied; thus occasional annotation errors (e.g., generic relations) persist, but coverage remains high.

4. Evaluation Protocols and Metrics

The standard evaluation metric is word-level perplexity (PPL), computed as

$\mathrm{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^N \log p(w_i | w_{<i})\right)$

for a held-out sequence $w_{1:N}$ . All baselines report PPL inclusive of ‹unk› and sentence markers. For knowledge-enhanced datasets and models, additional metrics include unknown-penalized perplexity (UPP) and factual accuracy in cloze-style completion tasks (e.g., @5 accuracy for entity prediction).

5. Benchmark Applications and Use Cases

WikiText corpora enable benchmarking of neural LLMs on tasks involving rare-word prediction, long-range dependencies, and open-domain factual recall. For instance, pointer-sentinel mixture models (Merity et al., 2016), AWD-LSTM baselines, and knowledge-graph LLMs (KGLM) have been benchmarked on WikiText and linked extensions. Notably, Linked WikiText-2 allows for the evaluation of fact-aware LMs, with KGLM achieving PPL 44.1 versus 74.8 for AWD-LSTM, and up to 95% @5 factual completion accuracy given gold annotations (IV et al., 2019).

WikiText-TL-39 supports evaluation of language modeling and downstream classification for Tagalog via BERT and ULMFiT, with recommendations for optimal hyperparameters. ULMFiT is highlighted as a cost-effective baseline; BERT offers best performance on high-resource compute. Both methods exhibit robustness to moderate reductions in training data (≤ 0.08 error increase down to 1k examples).

6. Integration with Knowledge Graphs

Several datasets further pair WikiText articles with external knowledge graphs for graph-to-text and text-to-graph tasks. WikiGraphs (Wang et al., 2021) aligns WikiText-103 articles with 1-hop Freebase subgraphs, providing 23,522 article-graph pairs (mean ≈39 nodes/graph, ≈48 edges/graph, ≈3.5k tokens/article). This facilitates benchmark tasks in graph-conditioned text generation and retrieval. Notably, conditioning on GNN-encoded graphs increases the topical fidelity of generated text (reverse BLEU triples compared to the text-only baseline), though it does not yield perplexity gains.

7. Impact, Licensing, and Limitations

The WikiText datasets, and their extensions, serve as intermediate-scale, realistic corpora bridging the gap between heavily preprocessed corpora (e.g., Penn Treebank) and web-scale datasets. Their design enables rigorous evaluation of linguistic modeling, rare-word handling, and structured knowledge integration—especially through split protocols preserving document-level context and annotation schemes mapping to real KGs.

All releases inherit the Creative Commons Attribution-ShareAlike 3.0 license from Wikipedia and Wikidata; redistribution must comply accordingly (Merity et al., 2016). A limitation is the static nature of text snapshots and occasional annotation noise in distantly supervised variants. Extensions such as WikiGraphs and Linked WikiText-2 point toward ongoing research in tightly coupled text-graph modeling and scalable knowledge-aware language understanding.

References:

"Pointer Sentinel Mixture Models" (Merity et al., 2016)
"Barack's Wife Hillary: Using Knowledge-Graphs for Fact-Aware Language Modeling" (IV et al., 2019)
"Evaluating LLM Finetuning Techniques for Low-resource Languages" (Cruz et al., 2019)
"WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset" (Wang et al., 2021)

Markdown Report Issue Upgrade to Chat

References (4)

Pointer Sentinel Mixture Models (2016)

Barack's Wife Hillary: Using Knowledge-Graphs for Fact-Aware Language Modeling (2019)

Evaluating Language Model Finetuning Techniques for Low-resource Languages (2019)

WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WikiText Dataset.

WikiText Dataset Overview

1. Corpus Construction and Variants

2. Dataset Extensions: Linked WikiText and Multilingual Corpora

Linked WikiText-2

WikiText-TL-39 (Tagalog)

3. Annotation, Preprocessing, and Quality Control

4. Evaluation Protocols and Metrics

5. Benchmark Applications and Use Cases

6. Integration with Knowledge Graphs

7. Impact, Licensing, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

WikiText Dataset Overview

1. Corpus Construction and Variants

2. Dataset Extensions: Linked WikiText and Multilingual Corpora

Linked WikiText-2

WikiText-TL-39 (Tagalog)

3. Annotation, Preprocessing, and Quality Control

4. Evaluation Protocols and Metrics

5. Benchmark Applications and Use Cases

6. Integration with Knowledge Graphs

7. Impact, Licensing, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research