Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wikidata-TekGen: Scalable KG-to-Text Generation

Updated 2 April 2026
  • Wikidata-TekGen is a methodology that converts structured Wikidata triples into coherent, factual sentences using scalable neural pipelines.
  • It employs advanced models like T5-large and BERT-based semantic filtering to ensure high coverage, semantic fidelity, and minimal hallucinations.
  • The corpus comprises 18 million synthetic sentences, bolstering language model pre-training and enhancing performance in QA and knowledge probing tasks.

Wikidata-TekGen refers to a family of methodologies and corpora for generating natural-language text from structured Wikidata statements, designed to support large-scale knowledge graph (KG) verbalization for LLM pre-training, data-to-text evaluation, and downstream knowledge-intensive tasks. Central to Wikidata-TekGen is the use of advanced neural models, quality-controlled alignment methods, and scalable data processing pipelines to produce wide-coverage, factual, and semantically faithful renderings of Wikidata’s triples into coherent textual forms (Agarwal et al., 2020, Ta et al., 2022, Chisholm et al., 2017, Kaffee et al., 2018).

1. Corpus Construction Frameworks

Wikidata-TekGen corpora are constructed using pipelines that process Wikidata’s subject–relation–object triples, align them with Wikipedia or other external textual corpora, and generate synthetic sentences suitable for direct consumption by LLMs. The pipeline outlined in (Agarwal et al., 2020)—responsible for the “KeLM Corpus”—implements the following sequence:

  • Alignment: For every subject ss, the root section of its Wikipedia article is retrieved; each sentence tt is paired to Wikidata triples (s,r,o)(s, r, o) by checking for appearances of object aliases and resolving pronouns.
  • Neural Verbalization: A T5-large model is fine-tuned in two phases—first on noisy aligned data for broad predicate coverage, and then on the human-annotated WebNLG 2017 set to minimize hallucinations.
  • Semantic Filtering: Each generated (triplessentence)(\text{triples} \to \text{sentence}) pair is scored for semantic fidelity by a BERT-based regressor trained on explicit human-fidelity ratings, with the lowest-scoring 1% discarded.
  • Subgraph Grouping: To avoid stilted text, up to five triples sharing subject ss are selected based on highest relation co-occurrence for aggregation into a single “entity subgraph,” forming more fluent, information-dense sentences.

In mapping pipelines such as (Ta et al., 2022), data is further filtered and clustered by redundancy and semantic similarity to maximize both faithfulness and coverage, with noise handled by density-based clustering and outlier removal.

2. Corpus Characteristics and Statistics

The Wikidata-TekGen (KeLM) corpus, as described in (Agarwal et al., 2020), is one of the largest and most comprehensive synthetic KG-to-text corpora:

  • Scale: ~18 million synthetic sentences verbalize ~45 million distinct triples, spanning 1,522 unique relations. The corpus comprises approximately 286 million tokens (~15.9 tokens/sentence on average).
  • Predicates and Coverage: The most frequent predicates are “country” (3.2M), “date of birth” (2.8M), “instance of” (2.4M), “occupation” (2.1M), and “publication date” (1.7M). The vocabulary covers approximately 1.1 million unique surface forms.
  • Sentence-Length Distribution: Mean sentence length is 15.9 tokens, standard deviation 4.7, with 95% of sentences between 8 and 24 tokens.

The (Ta et al., 2022) mapping corpus, built for precision evaluation and sentence-level alignment, includes 18,510 high-confidence sentence–statement matches and details cluster statistics, type-matching scores (sc₂ up to 0.936 for qualifiers), and object linking success rates (TAGME sc₂ = 0.760).

3. Neural and Linguistic Methodologies

Wikidata-TekGen methodologies exploit both state-of-the-art sequence-to-sequence architectures and alignment heuristics. The T5-based approach (Agarwal et al., 2020) introduces:

  • Input Formatting: Concatenation of triples for each subject using “subject relation₁ object₁ ; ... ; relationₙ objectₙ,” facilitating multi-triple, context-aware generation.
  • Fine-tuning Regimes: Initial broad coverage is achieved via noisy, large-scale alignment; precision is reinforced by human-labeled data to reduce hallucinations.

Earlier neural approaches (Chisholm et al., 2017, Kaffee et al., 2018) highlight the evolution of this line of work:

  • GRU-based Encoder-Decoders: Linearization of slot-value sequences with global attention (or copy actions) yields strong factuality.
  • Copy and Attention Mechanisms: The insertion of explicit property placeholders and copy strategies ensures rare and OOV entities are robustly verbalized.
  • Autoencoding and Coverage Losses: Auxiliary reverse models are used to guarantee fact inclusion in outputs, directly optimizing for input–output factual alignment.

Recommended technical extensions, informed by these works, include graph convolutional (GCN/GAT) encoders for subgraph-aware aggregation, pointer-generator decoders for more flexible copy/generation tradeoffs, and integration of pretrained transformers for multilingual and low-resource scenarios (Kaffee et al., 2018).

4. Pre-training Integration and Downstream Evaluation

Wikidata-TekGen has demonstrated tangible improvements when integrated into LLM pre-training, especially in retrieval-augmented settings (Agarwal et al., 2020):

  • Integration into REALM: Synthetic sentences are grouped into 5.7 million “KeLM documents,” co-indexed with Wikipedia and used as retrieval sources for masked-language modeling and retrieval-contrastive InfoNCE objectives.
  • Empirical Results: On the LAMA knowledge probing suite, REALM+Wikidata-TekGen achieves 80.30% (Google-RE) and 69.13% (T-REx), representing improvements of +12.94 and +0.95 points over REALM (Wiki only).
  • Open-domain QA: Combining Wikipedia and TeKGen synthetic documents yields 41.47% NQ Exact Match and 43.90% WQ EM, +2.63 and +3.10 percentage points over Wikipedia-only retrieval.
  • Factuality and Toxicity: Average factual accuracy (LAMA+QA) improves by ≈9 points, and toxicity (by Jigsaw API) drops from 0.12% to 0.08% toxic tokens.

5. Evaluation Benchmarks and Dataset Release

Evaluation protocols are rigorous and aligned with both data-to-text and QA community standards:

  • Knowledge Probing: LAMA suite evaluates precision in masked object prediction.
  • Open-Domain QA: NaturalQuestions and WebQuestions tasks measure answer retrieval and generation in the presence of augmented factual synthetic corpora.
  • Type and Entity Matching: Entity linking (AIDA, TAGME, OpenTapioca, WAT, Wikifier) and type-matching against Wikidata/DBpedia taxonomy are used for alignment validation (Ta et al., 2022).
  • Corpus Distribution: The Wikidata-TeKGen/KeLM corpus is publicly available as JSONL per-entity documents: each contains the subject, all verbalized triples, and the generated synthetic text.

6. Applications and Extensions

Beyond pre-training, Wikidata-TekGen methodology supports a broader ecosystem:

  • Collaborative Data Curation: Datasets such as those in (Scharpf et al., 2022) leverage Wikidata-TekGen principles for multilingual, community-driven generation of structured question/answer datasets, auto-verification of formula-derived questions, and real-time educational content innovation.
  • Template-Based and Neural Data-to-Text: Both template-instantiated and neural network driven summarization, especially for underserved languages, draw from the Wikidata-TekGen paradigm by extending placeholder, copy, and morphologically-aware generation strategies (Kaffee et al., 2018).
  • Open Issues: Current challenges include enhancing handling of complex qualifiers, multi-sentence/paragraph stubs, and scaling entity-subgraph reasoning. Approaches such as chaining quads/TRIPLE-graphs and exploiting GCN/transformer architectures are highlighted as next steps (Ta et al., 2022, Kaffee et al., 2018).

7. Representative Example Outputs

Illustrative examples from (Agarwal et al., 2020) are as follows:

  • Three aligned triples:
    • (Anne Frank Diary, distributor, Universal Pictures)
    • (Anne Frank Diary, country, Germany)
    • (Anne Frank Diary, publication date, 03 March 2016)
    • → “The film was theatrically released in Germany on March 3, 2016, by Universal Pictures International.”
  • Multi-year sports biography:
    • (Edu, member of sports team, Tigres UANL)
    • (Edu, Tigres UANL start time, 01 January 1978)
    • (Edu, Tigres UANL end time, 01 January 1983)
    • → “Edu, who was born in 1949, played for Tigres UANL between 1978 and 1983.”

All code, data pipelines, and trained models for Wikidata-TekGen are released to enable reproducibility and broad experimentation with KG-to-text transfer and knowledge-enhanced language modeling (Agarwal et al., 2020, Ta et al., 2022, Chisholm et al., 2017, Kaffee et al., 2018).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wikidata-TekGen.