Text2KGBench: Ontology-Guided KG Benchmark

Updated 2 April 2026

Text2KGBench is a benchmark that evaluates language models’ ability to extract RDF triples from text while strictly adhering to a given ontology.
It leverages diverse datasets, including Wikidata-TekGen and DBpedia-WebNLG, to test models across various domains and structured ontologies.
Results reveal high ontology conformance but highlight challenges in fact extraction accuracy and hallucination reduction in generated knowledge graphs.

Text2KGBench is a benchmark specifically created to evaluate the capabilities of LLMs in generating ontology-compliant knowledge graphs (KGs) from raw text. Positioned at the intersection of natural language processing and the Semantic Web, Text2KGBench focuses on guided fact extraction—requiring models not only to extract facts from sentences but also to ensure strict adherence to a provided ontology, typically formalized in OWL or RDFS. Its design addresses the growing integration between LLM-driven open-ended text-to-structure mapping and the precision constraints required for downstream symbolic reasoning or applications such as fact-checking and explainability using knowledge graphs (Mihindukulasooriya et al., 2023).

1. Task Definition and Motivation

Text2KGBench operationalizes the ontology-driven knowledge graph generation task by presenting an input ontology $\mathcal{O}$ and a natural language sentence $s$ to the model, and requiring the output of a set of RDF-style triples $\mathcal{T} = \{ r(subject, object) \}$ that (i) strictly use the relations defined in $\mathcal{O}$ , (ii) respect the domain and range constraints for each relation, and (iii) faithfully encode only the information present or implied in $s$ . The ontology is presented to the model in a human-readable format: a list of concepts and a bulleted list of relations, each annotated with their domain and range. At evaluation, only relation-level membership in $\mathcal{O}$ is checked; richer description-logic (DL) reasoning (class hierarchy, disjointness, RDFS/OWL restrictions) is reserved for future work.

The creation of Text2KGBench was motivated by the simultaneous advances in instruction-tuned LLMs (e.g., GPT-3, LLaMA, Vicuna, Alpaca) and in structured knowledge representation (KGs). Where prior relation extraction focused on closed, fixed-schema tasks, Text2KGBench enables open-ended, ontology-guided extraction, thereby providing a realistic testbed for measuring LLMs’ symbolic reasoning and fact fidelity in a scenario closely mirroring actual KG construction (Mihindukulasooriya et al., 2023).

2. Datasets and Ontology Coverage

Text2KGBench comprises two comprehensive corpora, covering a diverse range of domains and ontology structures:

A. Wikidata-TekGen Corpus:

This corpus includes 13,474 sentence–triple alignments, each grounded in one of ten mini-ontologies derived from Wikidata. Domains include Movies, Music, Sport, Books, Military, Computers, Space, Politics, Nature, and Culture. Each domain-specific ontology defines a distinct set of concepts and relations. Sentences, mostly drawn from Wikipedia via distant supervision (the TekGen process), are aligned with triples according to the target ontology; a manually verified subset (939 test sentences) and a set of 174 newly authored “unseen” sentences provide robustness and generalization testing.

Example (Music ontology):

Sentence: “Beethoven’s Ninth Symphony was composed in two years and first performed in Vienna.”
Extracted triples: composer(Ninth Symphony, Beethoven), first_performance(Ninth Symphony, Vienna)

B. DBpedia-WebNLG Corpus:

Contains 4,860 aligned instances across nineteen ontologies adapted from WebNLG categories, covering domains such as University, Airport, and Food. Due to the crowdsourced nature of WebNLG alignments, minimal additional cleaning was required.

Ontologies are defined by sets of classes $C$ and binary relations $R$ , with each $r \in R$ having domain and range constraints (e.g., director(film, human), genre(film, genre)). Formal correctness of an extracted triple $t = r(s,o)$ requires $s$ 0 and $s$ 1.

3. Benchmark Inputs and Prompting Structure

In benchmark evaluation, each test case consists of:

The ontology in human-readable form (concepts, relations with domain/range)
The target sentence
Optionally, “few-shot” in-context examples, selected by SBERT or T5-XXL for maximal semantic similarity

Prompts are standardized, comprising:

High-level instruction (extract triples per ontology constraints)
Ontology description (e.g., director(Film, Human))
1–2 demonstration pairs (input sentence and corresponding gold triples)
Test sentence with a “Test Output:” section for the model to fill

Example (Movie ontology):

Instruction: “Given the following ontology and sentences, please extract the triples from the sentence according to the relations in the ontology. In the output, only include the triples in the given format.”
Ontology: Film, Human, Award; relations …
Example: “Casablanca was directed by Michael Curtiz.” → director(Casablanca, Michael Curtiz)
Test Sentence: “Inception is a 2010 science fiction film directed by Christopher Nolan.”
Expected Model Output: director(Inception, Christopher Nolan)

4. Evaluation Metrics and Hallucination Analysis

Text2KGBench defines seven core evaluation metrics:

Fact Extraction Accuracy:
- Precision $s$ 2
- Recall $s$ 3
- F1 $s$ 4
- Where $s$ 5 is the model output and $s$ 6 the gold set. Only (possibly incomplete) gold triples affect recall; unannotated but correct triples do not penalize.
Ontology Conformance (OC):

$s$ 7 The proportion of extracted triples whose relation appears in the provided ontology $s$ 8.

Intrinsic Hallucinations:
- Subject Hallucination (SH): $s$ 9
- Relation Hallucination (RH): $\mathcal{T} = \{ r(subject, object) \}$ 0
- Object Hallucination (OH): Defined analogously to SH.

A perfect system has SH = RH = OH = 0. Hallucination is detected if subject or object are not found in the sentence or ontology.

5. Baseline Performance and Experimental Findings

Text2KGBench provides baseline evaluations using Vicuna-13B and Alpaca-LoRA-13B under uniform prompt generation.

Summary of Results (Averages):

Corpus/Model	P	R	F1	OC	SH	RH	OH
Wikidata-TekGen, Vicuna-13B	0.38	0.34	0.35	0.83	0.17	0.17	0.17
Wikidata-TekGen, Alpaca	0.32	0.26	0.27	0.87	0.18	0.13	0.17
DBpedia-WebNLG, Vicuna	0.34	0.27	0.30	0.93	0.12	0.07	0.28
DBpedia-WebNLG, Alpaca	0.32	0.23	0.25	0.91	0.16	0.09	0.38

Key observations include:

F1 scores rarely exceed 0.6 (even in best single-ontology cases)
Ontology conformance is generally high (0.83–0.93), indicating correct relation selection
Hallucination rates (especially object hallucination) remain nontrivial, peaking at 0.28–0.38 in some cases
Generalization performance is reduced on "unseen" sentences, suggesting reliance on memorized extraction patterns

A plausible implication is that current LLMs, while capable of recognizing canonical relations and domain/range structure, struggle to maintain both high coverage and hallucination avoidance, especially when faced with sentences outside their observed training distribution (Mihindukulasooriya et al., 2023).

6. Significance, Limitations, and Future Directions

Text2KGBench is the first public benchmark devoted to ontology-driven KG construction from free text, emphasizing rigorous symbolic conformance as well as content fidelity. It exposes a concrete performance gap between ontology compliance and factual precision, especially under open-ended, previously unseen textual input.

The benchmark highlights several avenues for future research, including:

Automatic retrieval of sub-ontologies to enable scaling to large real-world schemas
Direct interaction with formal KG representations (RDF/OWL) rather than natural language verbalization
Neuro-symbolic validation (e.g., description-logic reasoner checks for deeper consistency)
Explicit hallucination reduction through external KG retrieval or fact-checking integration
Bias and fairness analyses via contrastive test cases
Systematic benchmarking of leading commercial and emerging open-source LLMs within a uniform framework

All data, ontologies, prompt templates, and evaluation scripts are released under CC BY 4.0 to facilitate continued methodological advances and comparative studies in ontology-guided knowledge graph extraction (Mihindukulasooriya et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Text2KGBench.