Papers
Topics
Authors
Recent
Search
2000 character limit reached

WikiText Dataset Overview

Updated 16 March 2026
  • WikiText is a high-quality language modeling corpus derived from curated Wikipedia articles, ensuring factual reliability and rich linguistic data.
  • It employs systematic tokenization, rigorous preprocessing, and document-level splits to support robust evaluation using metrics like perplexity and UPP.
  • Extensions such as Linked WikiText and WikiGraphs integrate structured knowledge annotations to enhance rare-word prediction and factual recall in language models.

The WikiText dataset is a suite of large-scale, high-quality language modeling corpora derived from English-language Wikipedia. It is specifically curated to support research on rare-word prediction, long-range dependency modeling, and knowledge-augmented LLMs. Several aligned variants and extensions exist for multilingual settings and integration with structured knowledge graphs.

1. Corpus Construction and Variants

The core WikiText corpora, WikiText-2 and WikiText-103, were introduced in "Pointer Sentinel Mixture Models" (Merity et al., 2016). Articles were sourced from English Wikipedia, strictly filtered by editorial quality—only articles classified as “Good” (23,805) or “Featured” (4,790) were included, ensuring factual reliability and linguistic quality. Text extraction removed complex MediaWiki macros, lists-as-tables-of-contents, and replaced all mathematical/LaTeX content by a single token ‹formula›.

Tokenization uses the Moses tokenizer, with additional numeric splitting (e.g., “8,600” → “8 @,@ 600”) and custom punctuation processing. Article-level splits into train/validation/test ensure zero overlap of content across splits. WikiText-2 comprises 600/60/60 articles across splits, while WikiText-103 includes 28,475/60/60 for train/validation/test, respectively.

Table 1: Core WikiText Dataset Sizes (Merity et al., 2016)

Corpus Train Articles Train Tokens Vocabulary Size OOV Rate
WikiText-2 600 2,088,628 33,278 2.6%
WikiText-103 28,475 103,227,021 267,735 0.4%

The vocabulary omits tokens with training set frequency below 3; these are uniformly mapped to the special ‹unk› token.

2. Dataset Extensions: Linked WikiText and Multilingual Corpora

Linked WikiText-2

Linked WikiText-2 augments WikiText-2 with dense entity-level annotations aligned to Wikidata, providing a bridge between narrative text and structured KGs ["Barack's Wife Hillary: Using Knowledge-Graphs for Fact-Aware Language Modeling" (IV et al., 2019)]. Raw HTML was used to extract and preserve internal Wikipedia hyperlinks. Entity linking proceeded via:

  1. Gold-standard hyperlinks mapped to Wikidata Q-IDs.
  2. Supplementary neural entity linking (Gupta et al., 2017).
  3. Coreference resolution using Stanford CoreNLP.

As each entity mention is encountered, one-hop Wikidata neighbors are dynamically added to the set of candidates; observed re-occurring links allow capturing (parent, relation, child) triples as plausible mention justifications. Expansion via alias matching and explicit heuristics (for dates and quantities) further increases coverage. Annotation format, per token, tracks mention type, Wikidata entity, possible parent, and relation. The corpus yields 600 train / 60 dev / 60 test documents, with 2,019,195 / 207,982 / 236,062 tokens (train/dev/test). Roughly 10% of all tokens are annotated as entity mentions.

WikiText-TL-39 (Tagalog)

WikiText-TL-39 (Cruz et al., 2019) extends the WikiText methodology to Tagalog, built from all Tagalog Wikipedia articles (approximately 75,000). There is no featured-article filter; all pages with titles A–Z are included. Tokenization follows Unicode normalization and Moses rules. For BERT experiments, a SentencePiece BPE vocabulary of 290k or 30k was used, while ULMFiT experiments used a case-preserving marker and top-30k word-level vocabulary. Table: WikiText-TL-39 Data Statistics

Split Documents Tokens Unique Tokens
Train 120,975 39,267,089 279,153
Validation 25,919 8,356,898 164,159
Test 25,921 8,333,288 175,999

OOV tokens in test constitute 0.102% of the split.

3. Annotation, Preprocessing, and Quality Control

The dataset preserves punctuation, original case, and newlines (as ‹eos›), and includes minimal preprocessing to maintain linguistic fidelity. In Linked WikiText-2, tokens are annotated with:

  • mention type tt{new,related,}t_t \in \{\text{new}, \text{related}, \emptyset\}
  • linked Wikidata entity ete_t (Q-ID) or \emptyset
  • parent ptp_t (prior entity, if "related") or \emptyset
  • relation rtr_t (Wikidata relation) or \emptyset

Unlinked tokens have all annotation fields set to \emptyset. All entity and relation embeddings for KG-aligned extensions are pretrained using a 2-hop subgraph and TransE objectives:

δ(vp,vr,ve)=vp+vrve2\delta(v_p,v_r,v_e) = \|v_p + v_r – v_e\|^2

L=max(0,γ+δ(vp,vr,ve)δ(vp,vr,ve))\mathcal{L} = \max(0, \gamma + \delta(v_p,v_r,v_e) – \delta(v_p',v_r,v_e'))

Stale (never re-mentioned) entities may be pruned from the local KG. No manual correction was applied; thus occasional annotation errors (e.g., generic relations) persist, but coverage remains high.

4. Evaluation Protocols and Metrics

The standard evaluation metric is word-level perplexity (PPL), computed as

PPL=exp(1Ni=1Nlogp(wiw<i))\mathrm{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^N \log p(w_i | w_{<i})\right)

for a held-out sequence w1:Nw_{1:N}. All baselines report PPL inclusive of ‹unk› and sentence markers. For knowledge-enhanced datasets and models, additional metrics include unknown-penalized perplexity (UPP) and factual accuracy in cloze-style completion tasks (e.g., @5 accuracy for entity prediction).

5. Benchmark Applications and Use Cases

WikiText corpora enable benchmarking of neural LLMs on tasks involving rare-word prediction, long-range dependencies, and open-domain factual recall. For instance, pointer-sentinel mixture models (Merity et al., 2016), AWD-LSTM baselines, and knowledge-graph LLMs (KGLM) have been benchmarked on WikiText and linked extensions. Notably, Linked WikiText-2 allows for the evaluation of fact-aware LMs, with KGLM achieving PPL 44.1 versus 74.8 for AWD-LSTM, and up to 95% @5 factual completion accuracy given gold annotations (IV et al., 2019).

WikiText-TL-39 supports evaluation of language modeling and downstream classification for Tagalog via BERT and ULMFiT, with recommendations for optimal hyperparameters. ULMFiT is highlighted as a cost-effective baseline; BERT offers best performance on high-resource compute. Both methods exhibit robustness to moderate reductions in training data (≤ 0.08 error increase down to 1k examples).

6. Integration with Knowledge Graphs

Several datasets further pair WikiText articles with external knowledge graphs for graph-to-text and text-to-graph tasks. WikiGraphs (Wang et al., 2021) aligns WikiText-103 articles with 1-hop Freebase subgraphs, providing 23,522 article-graph pairs (mean ≈39 nodes/graph, ≈48 edges/graph, ≈3.5k tokens/article). This facilitates benchmark tasks in graph-conditioned text generation and retrieval. Notably, conditioning on GNN-encoded graphs increases the topical fidelity of generated text (reverse BLEU triples compared to the text-only baseline), though it does not yield perplexity gains.

7. Impact, Licensing, and Limitations

The WikiText datasets, and their extensions, serve as intermediate-scale, realistic corpora bridging the gap between heavily preprocessed corpora (e.g., Penn Treebank) and web-scale datasets. Their design enables rigorous evaluation of linguistic modeling, rare-word handling, and structured knowledge integration—especially through split protocols preserving document-level context and annotation schemes mapping to real KGs.

All releases inherit the Creative Commons Attribution-ShareAlike 3.0 license from Wikipedia and Wikidata; redistribution must comply accordingly (Merity et al., 2016). A limitation is the static nature of text snapshots and occasional annotation noise in distantly supervised variants. Extensions such as WikiGraphs and Linked WikiText-2 point toward ongoing research in tightly coupled text-graph modeling and scalable knowledge-aware language understanding.

References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WikiText Dataset.