Text2KGBench: Ontology-Guided KG Benchmark
- Text2KGBench is a benchmark that evaluates language models’ ability to extract RDF triples from text while strictly adhering to a given ontology.
- It leverages diverse datasets, including Wikidata-TekGen and DBpedia-WebNLG, to test models across various domains and structured ontologies.
- Results reveal high ontology conformance but highlight challenges in fact extraction accuracy and hallucination reduction in generated knowledge graphs.
Text2KGBench is a benchmark specifically created to evaluate the capabilities of LLMs in generating ontology-compliant knowledge graphs (KGs) from raw text. Positioned at the intersection of natural language processing and the Semantic Web, Text2KGBench focuses on guided fact extraction—requiring models not only to extract facts from sentences but also to ensure strict adherence to a provided ontology, typically formalized in OWL or RDFS. Its design addresses the growing integration between LLM-driven open-ended text-to-structure mapping and the precision constraints required for downstream symbolic reasoning or applications such as fact-checking and explainability using knowledge graphs (Mihindukulasooriya et al., 2023).
1. Task Definition and Motivation
Text2KGBench operationalizes the ontology-driven knowledge graph generation task by presenting an input ontology and a natural language sentence to the model, and requiring the output of a set of RDF-style triples that (i) strictly use the relations defined in , (ii) respect the domain and range constraints for each relation, and (iii) faithfully encode only the information present or implied in . The ontology is presented to the model in a human-readable format: a list of concepts and a bulleted list of relations, each annotated with their domain and range. At evaluation, only relation-level membership in is checked; richer description-logic (DL) reasoning (class hierarchy, disjointness, RDFS/OWL restrictions) is reserved for future work.
The creation of Text2KGBench was motivated by the simultaneous advances in instruction-tuned LLMs (e.g., GPT-3, LLaMA, Vicuna, Alpaca) and in structured knowledge representation (KGs). Where prior relation extraction focused on closed, fixed-schema tasks, Text2KGBench enables open-ended, ontology-guided extraction, thereby providing a realistic testbed for measuring LLMs’ symbolic reasoning and fact fidelity in a scenario closely mirroring actual KG construction (Mihindukulasooriya et al., 2023).
2. Datasets and Ontology Coverage
Text2KGBench comprises two comprehensive corpora, covering a diverse range of domains and ontology structures:
A. Wikidata-TekGen Corpus:
This corpus includes 13,474 sentence–triple alignments, each grounded in one of ten mini-ontologies derived from Wikidata. Domains include Movies, Music, Sport, Books, Military, Computers, Space, Politics, Nature, and Culture. Each domain-specific ontology defines a distinct set of concepts and relations. Sentences, mostly drawn from Wikipedia via distant supervision (the TekGen process), are aligned with triples according to the target ontology; a manually verified subset (939 test sentences) and a set of 174 newly authored “unseen” sentences provide robustness and generalization testing.
Example (Music ontology):
- Sentence: “Beethoven’s Ninth Symphony was composed in two years and first performed in Vienna.”
- Extracted triples: composer(Ninth Symphony, Beethoven), first_performance(Ninth Symphony, Vienna)
B. DBpedia-WebNLG Corpus:
Contains 4,860 aligned instances across nineteen ontologies adapted from WebNLG categories, covering domains such as University, Airport, and Food. Due to the crowdsourced nature of WebNLG alignments, minimal additional cleaning was required.
Ontologies are defined by sets of classes and binary relations , with each having domain and range constraints (e.g., director(film, human), genre(film, genre)). Formal correctness of an extracted triple requires 0 and 1.
3. Benchmark Inputs and Prompting Structure
In benchmark evaluation, each test case consists of:
- The ontology in human-readable form (concepts, relations with domain/range)
- The target sentence
- Optionally, “few-shot” in-context examples, selected by SBERT or T5-XXL for maximal semantic similarity
Prompts are standardized, comprising:
- High-level instruction (extract triples per ontology constraints)
- Ontology description (e.g., director(Film, Human))
- 1–2 demonstration pairs (input sentence and corresponding gold triples)
- Test sentence with a “Test Output:” section for the model to fill
Example (Movie ontology):
- Instruction: “Given the following ontology and sentences, please extract the triples from the sentence according to the relations in the ontology. In the output, only include the triples in the given format.”
- Ontology: Film, Human, Award; relations …
- Example: “Casablanca was directed by Michael Curtiz.” → director(Casablanca, Michael Curtiz)
- Test Sentence: “Inception is a 2010 science fiction film directed by Christopher Nolan.”
- Expected Model Output: director(Inception, Christopher Nolan)
4. Evaluation Metrics and Hallucination Analysis
Text2KGBench defines seven core evaluation metrics:
- Fact Extraction Accuracy:
- Precision 2
- Recall 3
- F1 4
- Where 5 is the model output and 6 the gold set. Only (possibly incomplete) gold triples affect recall; unannotated but correct triples do not penalize.
- Ontology Conformance (OC):
7 The proportion of extracted triples whose relation appears in the provided ontology 8.
- Intrinsic Hallucinations:
- Subject Hallucination (SH): 9
- Relation Hallucination (RH): 0
- Object Hallucination (OH): Defined analogously to SH.
A perfect system has SH = RH = OH = 0. Hallucination is detected if subject or object are not found in the sentence or ontology.
5. Baseline Performance and Experimental Findings
Text2KGBench provides baseline evaluations using Vicuna-13B and Alpaca-LoRA-13B under uniform prompt generation.
Summary of Results (Averages):
| Corpus/Model | P | R | F1 | OC | SH | RH | OH |
|---|---|---|---|---|---|---|---|
| Wikidata-TekGen, Vicuna-13B | 0.38 | 0.34 | 0.35 | 0.83 | 0.17 | 0.17 | 0.17 |
| Wikidata-TekGen, Alpaca | 0.32 | 0.26 | 0.27 | 0.87 | 0.18 | 0.13 | 0.17 |
| DBpedia-WebNLG, Vicuna | 0.34 | 0.27 | 0.30 | 0.93 | 0.12 | 0.07 | 0.28 |
| DBpedia-WebNLG, Alpaca | 0.32 | 0.23 | 0.25 | 0.91 | 0.16 | 0.09 | 0.38 |
Key observations include:
- F1 scores rarely exceed 0.6 (even in best single-ontology cases)
- Ontology conformance is generally high (0.83–0.93), indicating correct relation selection
- Hallucination rates (especially object hallucination) remain nontrivial, peaking at 0.28–0.38 in some cases
- Generalization performance is reduced on "unseen" sentences, suggesting reliance on memorized extraction patterns
A plausible implication is that current LLMs, while capable of recognizing canonical relations and domain/range structure, struggle to maintain both high coverage and hallucination avoidance, especially when faced with sentences outside their observed training distribution (Mihindukulasooriya et al., 2023).
6. Significance, Limitations, and Future Directions
Text2KGBench is the first public benchmark devoted to ontology-driven KG construction from free text, emphasizing rigorous symbolic conformance as well as content fidelity. It exposes a concrete performance gap between ontology compliance and factual precision, especially under open-ended, previously unseen textual input.
The benchmark highlights several avenues for future research, including:
- Automatic retrieval of sub-ontologies to enable scaling to large real-world schemas
- Direct interaction with formal KG representations (RDF/OWL) rather than natural language verbalization
- Neuro-symbolic validation (e.g., description-logic reasoner checks for deeper consistency)
- Explicit hallucination reduction through external KG retrieval or fact-checking integration
- Bias and fairness analyses via contrastive test cases
- Systematic benchmarking of leading commercial and emerging open-source LLMs within a uniform framework
All data, ontologies, prompt templates, and evaluation scripts are released under CC BY 4.0 to facilitate continued methodological advances and comparative studies in ontology-guided knowledge graph extraction (Mihindukulasooriya et al., 2023).