KGQAGen-10k Benchmark: KGQA & LLM Retrieval

Updated 4 January 2026

KGQAGen-10k is a comprehensive benchmark for evaluating knowledge-graph QA and long-context retrieval, enabling controlled multi-hop and compositional reasoning assessments.
It employs systematic LLM-in-the-loop and template-driven methodologies to extract QA pairs from Wikidata and financial documents for rigorous testing.
Empirical results highlight challenges in multi-hop reasoning and set operations, exposing current limitations in LLMs and KG retrieval-augmented methods.

KGQAGen-10k is an evaluation benchmark for knowledge-graph-based question answering (KGQA) and long-context LLM retrieval, constructed by systematic extraction of question–answer (QA) instances from a knowledge graph (KG) representation of domain-specific documents or Wikipedia-grounded entities. The term refers both to a 10,787-instance Wikidata-centric QA set for KGQA robustness testing (Zhang et al., 29 May 2025) and the publicly-released 4,418 QA subset from a larger 20,139-example long-context KG-based QA benchmark derived from SEC financial agreements (Tatarinov et al., 18 May 2025). Both variants share a principled approach to KG grounding, multi-hop and compositional reasoning, and systematic template or LLM-in-the-loop generation, enabling controlled assessments of multi-hop reasoning, set operations, and semantic equivalence.

1. Construction Methodologies

KGQAGen-10k is the result of two related but distinct frameworks for KG-based QA:

KGQAGen employs a three-stage LLM-in-the-loop pipeline:

Seed Subgraph Initialization: Seed entities (16,000) are drawn from Wikipedia’s Level-5 Vital Articles. For each seed $e$ , a 1-hop Wikidata neighborhood (15 sampled triples) initializes $G_e^{(0)}$ .
Iterative LLM-Guided Expansion: At each step $t$ , a LLM examines $G_e^{(t)}$ to determine sufficiency for non-trivial, multi-hop questions. If insufficient, it suggests entity sets $C_e^{(t)}$ for further 1-hop expansion (sampling 10–15 triples per entity), producing $G_e^{(t+1)}$ . This proceeds until the subgraph supports a ≥2-hop question.
QA and Proof Generation: Upon sufficiency, the LLM produces: a natural-language question $q_e$ , an answer set $\mathcal{A}_e$ , a minimal supporting subgraph $\mathcal{P}_e$ , and a SPARQL query $\mathcal{Q}_e$ , all serialized in strict JSON.
SPARQL Validation: The query $\mathcal{Q}_e$ is executed. If $\mathcal{A}_e = \tilde{\mathcal{U}}_e$ (SPARQL result), the instance is accepted. Otherwise, an automated, up to three-iteration LLM refinement cycle attempts query correction; only aligned instances are retained.

Here, 170 SEC credit-agreement documents are hand-annotated for seven key entity and relation types, each formalized as RDF graphs $(V, E, R)$ . QA pairs are extracted by template-driven queries, parameterized by:

Hops ( $H$ ): Path-length through the KG (1, 2, or 3).
Plurality ( $P$ ): Singlet or plural answer set.
Set Operations ( $\#$ SO): Union, intersection, or difference operations over answer sets, up to 2–3 per instance.

The composite difficulty level $L = H + P + \#$ SO partitions questions into easy ( $L=1$ ), medium ( $2 \leq L \leq 4$ ), and hard ( $L=5$ ). Instance validity is enforced by graph traversal, operation cardinality, and schema constraints, with SPARQL-inspired pseudocode for reproducibility.

2. Dataset Structure and Statistics

Final filtered instances: 10,787.
Question length: 7.5% short (<16 w), 61.1% moderate (16–30 w), 31.4% long (>30 w).
Answer cardinality: 84.5% single, 5.8% two, 9.7% three or more answers.
By construction: All questions ≥2 hops.
Topic distribution: Arts (42.3%), Astronomy (17.3%), STEM (16%), other domains balance.

Total instances: 20,139; public “Dev” split (KGQAGen-10k): 4,418.
Difficulty distribution (Dev, 4,418): 1,499 easy ( $L=1$ ), 2,680 medium ( $2 \leq L \leq 4$ ), 239 hard ( $L=5$ ).
QA per document: 14.75 (Dev avg), up to 83 per document.
Complexity breakdown: Explicit partitioning by hops, plurality, and set-operation count (see tables below).

Hops (H)	Plurality (P)	SetOps (#SO)	Difficulty (L)	# Questions
1	0	0	1	3,100
2	1	1	4	1,250
3	0	2	5	500

Table: Illustrative complexity breakdown; full table in (Tatarinov et al., 18 May 2025).

3. Formal Knowledge Graph Schemas

Both benchmarks utilize an explicit KG formalism:

Model: $\mathcal{G} \subseteq \mathcal{E} \times \mathcal{R} \times \mathcal{E}$ , with entities $\mathcal{E}$ and relations $\mathcal{R}$ .
Subgraphs ( $G_e^{(t)}$ ): Built iteratively around seed $e$ ; expansions determined by LLM reasoning sufficiency.
Supporting evidence: Each QA pair linked to minimal supporting triples and executable SPARQL query.

Model: $G=(V,E,R)$ ; $V$ = entity instances, $R$ = edge types (hasRole, hasSubRole, hasPosition, etc.), $E$ = directed triples.
Ontology layer: Defines type and subclass hierarchy (e.g., PersonPosition $rdfs:subClassOf$ Person).
Data layer: Per-document instantiation; global corpus aggregate denoted $G_{\mathrm{global}}$ .

4. Evaluation Protocols and Metrics

Evaluation emphasizes both exact-set accuracy and semantic robustness:

Splits: Train/Dev/Test (Wikidata: 8,629/1,079/1,079; Financial: 4,418 public dev, 15,721 held-out test).
Matching schemes:

Exact Match (EM): String equality after normalization.
LLM-Assisted Semantic Match (LASM): GPT-4o-mini determines semantic equivalence if EM fails.

Metrics:
- Accuracy (set match),
- Hit@1 (top answer),
- Precision, Recall, F1 (set overlap),
- Word-level F1,
- Normalized Levenshtein (EditDist),
- Cosine Similarity (embedding),
- LLM-as-a-Judge (1–5 scale).

Models evaluated include GPT-4.1, GPT-4o, DeepSeek-V3, PoG (GPT-4o), GCR (LLaMA-3.1 + GPT-4o), and others.

5. Empirical Results and Analysis

Wikidata KGQAGen-10k (Zhang et al., 29 May 2025):
- Best LLMs (GPT-4.1): EM/LASM accuracy ≈47.4%/57.0%.
- KG-RAG models: EM/LASM accuracy up to 51%/61% (PoG (GPT-4o)).
- Upper bound (LLM-SP): LASM accuracy 84.9% with gold subgraph retrieval, confirming retrieval as primary limitation.
- Failure patterns: Multi-hop (≥3), complex compositional, ambiguous retrieval, and SPARQL numeric/comparison constructs.
Financial KGQAGen-10k (Tatarinov et al., 18 May 2025):
- Model F1: Drops from Easy ≈0.67 to Hard ≈0.12 (e.g., Gemini-2.0-Flash).
- Set-ops and multi-hop: High failure (“not found” up to 77%), semantic drift, “lost-in-the-middle” phenomenon for long-context inputs.

Interpretation: These results indicate persistent failure modes in reasoning over multi-hop and compositional KG structures and expose the limitations of current LLMs and retrieval-augmented methods.

6. Quality Assurance, Pitfall Diagnosis, and Illustrative Cases

Problems in prior KGQA (WebQSP, CWQ): Incomplete answers, outdated facts, ambiguous queries, and rigid evaluation.
KGQAGen-10k safeguards:

Grounding in latest Wikidata (April 2025).
LLM-guided multi-hop generation to enforce question depth.
Joint question–answer–proof output, ensuring direct KG evidence links.
Symbolic SPARQL verification at generation time; 96.3% manual audit factual accuracy.
SPARQL re-execution enables periodic revalidation as Wikidata evolves.

Representative Instances

Example	Question	Supporting Triples	Answer
Johann Martin Schleyer	“Who among the nominees for the Nobel Peace Prize was also the founder of the International Volapük Academy?”	(Q12712) — P1411 — (Q35637); (Q3358168) — P112 — (Q12712)	Johann Martin Schleyer (Q12712)
Astronomy & Astrophysics	“What astronomical journal, published by EDP Sciences and edited by Thierry Forveille, succeeded Zeitschrift für Astrophysik as its immediate follower?”	Q752075 — P123 — Q114404; Q752075 — P98 — Q46260676; Q3575110 — P156 — Q752075	Astronomy and Astrophysics (Q752075)

Plausible implication: These structured samples demonstrate the multi-hop and compositional demands encoded in KGQAGen-10k.

7. Impact and Recommendations

KGQAGen-10k establishes new standards for KGQA robustness, overcoming factual and methodological pitfalls prevalent in widely-used datasets. Both the Wikidata and financial-domain variants offer systematic multi-hop, compositional, and set-operation challenges, which reveal persistent upper limits for even state-of-the-art LLMs and KG-RAG methods. Recommendations include:

Integrating graph-reasoning neural modules or fine-tuning on template-driven multi-hop QA.
Developing retrieval-augmented and context-chunking approaches that preserve cross-chunk links for long-context LLMs.
Generalizing to new domains (e.g., legal contracts, global policy) for broader evaluation.

Public releases of code, KG schemas (RDF/Turtle), extraction queries, and leaderboard infrastructure for KGQAGen-10k facilitate standardized benchmarking and reproducibility in KGQA and long-context LLM evaluation research (Tatarinov et al., 18 May 2025, Zhang et al., 29 May 2025).

PDF Markdown Chat (Pro)

References (2)

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking (2025)

KG-QAGen: A Knowledge-Graph-Based Framework for Systematic Question Generation and Long-Context LLM Evaluation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to KGQAGen-10k.

KGQAGen-10k Benchmark: KGQA & LLM Retrieval

1. Construction Methodologies

(a) Wikidata-Centric KGQAGen-10k (Zhang et al., 29 May 2025)

(b) Financial-Agreement KGQAGen-10k (Tatarinov et al., 18 May 2025)

2. Dataset Structure and Statistics

Wikidata KGQAGen-10k (Zhang et al., 29 May 2025)

Financial KG-QAGen-10k (Tatarinov et al., 18 May 2025)

3. Formal Knowledge Graph Schemas

Wikidata (Zhang et al., 29 May 2025)

Financial Agreements (Tatarinov et al., 18 May 2025)

4. Evaluation Protocols and Metrics

5. Empirical Results and Analysis

6. Quality Assurance, Pitfall Diagnosis, and Illustrative Cases

Pitfall Mitigation (Zhang et al., 29 May 2025)

Representative Instances

7. Impact and Recommendations

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

KGQAGen-10k Benchmark: KGQA & LLM Retrieval

1. Construction Methodologies

(a) Wikidata-Centric KGQAGen-10k (Zhang et al., 29 May 2025)

(b) Financial-Agreement KGQAGen-10k (Tatarinov et al., 18 May 2025)

2. Dataset Structure and Statistics

Wikidata KGQAGen-10k (Zhang et al., 29 May 2025)

Financial KG-QAGen-10k (Tatarinov et al., 18 May 2025)

3. Formal Knowledge Graph Schemas

Wikidata (Zhang et al., 29 May 2025)

Financial Agreements (Tatarinov et al., 18 May 2025)

4. Evaluation Protocols and Metrics

5. Empirical Results and Analysis

6. Quality Assurance, Pitfall Diagnosis, and Illustrative Cases

Pitfall Mitigation (Zhang et al., 29 May 2025)

Representative Instances

7. Impact and Recommendations

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research