Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training (2010.12688v2)

Published 23 Oct 2020 in cs.CL

Abstract: Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, large-scale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that have been developed to integrate these two sources, our approach converts the KG into natural text, allowing it to be seamlessly integrated into existing LLMs. It carries the further advantages of improved factual accuracy and reduced toxicity in the resulting LLM. We evaluate this approach by augmenting the retrieval corpus in a retrieval LLM and showing significant improvements on the knowledge intensive tasks of open domain QA and the LAMA knowledge probe.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Oshin Agarwal (9 papers)
  2. Heming Ge (4 papers)
  3. Siamak Shakeri (29 papers)
  4. Rami Al-Rfou (34 papers)
Citations (39)

Summary

We haven't generated a summary for this paper yet.