WikiText Dataset Overview
- WikiText is a high-quality language modeling corpus derived from curated Wikipedia articles, ensuring factual reliability and rich linguistic data.
- It employs systematic tokenization, rigorous preprocessing, and document-level splits to support robust evaluation using metrics like perplexity and UPP.
- Extensions such as Linked WikiText and WikiGraphs integrate structured knowledge annotations to enhance rare-word prediction and factual recall in language models.
The WikiText dataset is a suite of large-scale, high-quality language modeling corpora derived from English-language Wikipedia. It is specifically curated to support research on rare-word prediction, long-range dependency modeling, and knowledge-augmented LLMs. Several aligned variants and extensions exist for multilingual settings and integration with structured knowledge graphs.
1. Corpus Construction and Variants
The core WikiText corpora, WikiText-2 and WikiText-103, were introduced in "Pointer Sentinel Mixture Models" (Merity et al., 2016). Articles were sourced from English Wikipedia, strictly filtered by editorial quality—only articles classified as “Good” (23,805) or “Featured” (4,790) were included, ensuring factual reliability and linguistic quality. Text extraction removed complex MediaWiki macros, lists-as-tables-of-contents, and replaced all mathematical/LaTeX content by a single token ‹formula›.
Tokenization uses the Moses tokenizer, with additional numeric splitting (e.g., “8,600” → “8 @,@ 600”) and custom punctuation processing. Article-level splits into train/validation/test ensure zero overlap of content across splits. WikiText-2 comprises 600/60/60 articles across splits, while WikiText-103 includes 28,475/60/60 for train/validation/test, respectively.
Table 1: Core WikiText Dataset Sizes (Merity et al., 2016)
| Corpus | Train Articles | Train Tokens | Vocabulary Size | OOV Rate |
|---|---|---|---|---|
| WikiText-2 | 600 | 2,088,628 | 33,278 | 2.6% |
| WikiText-103 | 28,475 | 103,227,021 | 267,735 | 0.4% |
The vocabulary omits tokens with training set frequency below 3; these are uniformly mapped to the special ‹unk› token.
2. Dataset Extensions: Linked WikiText and Multilingual Corpora
Linked WikiText-2
Linked WikiText-2 augments WikiText-2 with dense entity-level annotations aligned to Wikidata, providing a bridge between narrative text and structured KGs ["Barack's Wife Hillary: Using Knowledge-Graphs for Fact-Aware Language Modeling" (IV et al., 2019)]. Raw HTML was used to extract and preserve internal Wikipedia hyperlinks. Entity linking proceeded via:
- Gold-standard hyperlinks mapped to Wikidata Q-IDs.
- Supplementary neural entity linking (Gupta et al., 2017).
- Coreference resolution using Stanford CoreNLP.
As each entity mention is encountered, one-hop Wikidata neighbors are dynamically added to the set of candidates; observed re-occurring links allow capturing (parent, relation, child) triples as plausible mention justifications. Expansion via alias matching and explicit heuristics (for dates and quantities) further increases coverage. Annotation format, per token, tracks mention type, Wikidata entity, possible parent, and relation. The corpus yields 600 train / 60 dev / 60 test documents, with 2,019,195 / 207,982 / 236,062 tokens (train/dev/test). Roughly 10% of all tokens are annotated as entity mentions.
WikiText-TL-39 (Tagalog)
WikiText-TL-39 (Cruz et al., 2019) extends the WikiText methodology to Tagalog, built from all Tagalog Wikipedia articles (approximately 75,000). There is no featured-article filter; all pages with titles A–Z are included. Tokenization follows Unicode normalization and Moses rules. For BERT experiments, a SentencePiece BPE vocabulary of 290k or 30k was used, while ULMFiT experiments used a case-preserving marker and top-30k word-level vocabulary. Table: WikiText-TL-39 Data Statistics
| Split | Documents | Tokens | Unique Tokens |
|---|---|---|---|
| Train | 120,975 | 39,267,089 | 279,153 |
| Validation | 25,919 | 8,356,898 | 164,159 |
| Test | 25,921 | 8,333,288 | 175,999 |
OOV tokens in test constitute 0.102% of the split.
3. Annotation, Preprocessing, and Quality Control
The dataset preserves punctuation, original case, and newlines (as ‹eos›), and includes minimal preprocessing to maintain linguistic fidelity. In Linked WikiText-2, tokens are annotated with:
- mention type
- linked Wikidata entity (Q-ID) or
- parent (prior entity, if "related") or
- relation (Wikidata relation) or
Unlinked tokens have all annotation fields set to . All entity and relation embeddings for KG-aligned extensions are pretrained using a 2-hop subgraph and TransE objectives:
Stale (never re-mentioned) entities may be pruned from the local KG. No manual correction was applied; thus occasional annotation errors (e.g., generic relations) persist, but coverage remains high.
4. Evaluation Protocols and Metrics
The standard evaluation metric is word-level perplexity (PPL), computed as
for a held-out sequence . All baselines report PPL inclusive of ‹unk› and sentence markers. For knowledge-enhanced datasets and models, additional metrics include unknown-penalized perplexity (UPP) and factual accuracy in cloze-style completion tasks (e.g., @5 accuracy for entity prediction).
5. Benchmark Applications and Use Cases
WikiText corpora enable benchmarking of neural LLMs on tasks involving rare-word prediction, long-range dependencies, and open-domain factual recall. For instance, pointer-sentinel mixture models (Merity et al., 2016), AWD-LSTM baselines, and knowledge-graph LLMs (KGLM) have been benchmarked on WikiText and linked extensions. Notably, Linked WikiText-2 allows for the evaluation of fact-aware LMs, with KGLM achieving PPL 44.1 versus 74.8 for AWD-LSTM, and up to 95% @5 factual completion accuracy given gold annotations (IV et al., 2019).
WikiText-TL-39 supports evaluation of language modeling and downstream classification for Tagalog via BERT and ULMFiT, with recommendations for optimal hyperparameters. ULMFiT is highlighted as a cost-effective baseline; BERT offers best performance on high-resource compute. Both methods exhibit robustness to moderate reductions in training data (≤ 0.08 error increase down to 1k examples).
6. Integration with Knowledge Graphs
Several datasets further pair WikiText articles with external knowledge graphs for graph-to-text and text-to-graph tasks. WikiGraphs (Wang et al., 2021) aligns WikiText-103 articles with 1-hop Freebase subgraphs, providing 23,522 article-graph pairs (mean ≈39 nodes/graph, ≈48 edges/graph, ≈3.5k tokens/article). This facilitates benchmark tasks in graph-conditioned text generation and retrieval. Notably, conditioning on GNN-encoded graphs increases the topical fidelity of generated text (reverse BLEU triples compared to the text-only baseline), though it does not yield perplexity gains.
7. Impact, Licensing, and Limitations
The WikiText datasets, and their extensions, serve as intermediate-scale, realistic corpora bridging the gap between heavily preprocessed corpora (e.g., Penn Treebank) and web-scale datasets. Their design enables rigorous evaluation of linguistic modeling, rare-word handling, and structured knowledge integration—especially through split protocols preserving document-level context and annotation schemes mapping to real KGs.
All releases inherit the Creative Commons Attribution-ShareAlike 3.0 license from Wikipedia and Wikidata; redistribution must comply accordingly (Merity et al., 2016). A limitation is the static nature of text snapshots and occasional annotation noise in distantly supervised variants. Extensions such as WikiGraphs and Linked WikiText-2 point toward ongoing research in tightly coupled text-graph modeling and scalable knowledge-aware language understanding.
References:
- "Pointer Sentinel Mixture Models" (Merity et al., 2016)
- "Barack's Wife Hillary: Using Knowledge-Graphs for Fact-Aware Language Modeling" (IV et al., 2019)
- "Evaluating LLM Finetuning Techniques for Low-resource Languages" (Cruz et al., 2019)
- "WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset" (Wang et al., 2021)