Papers
Topics
Authors
Recent
2000 character limit reached

KGQAGen-10k Benchmark: KGQA & LLM Retrieval

Updated 4 January 2026
  • KGQAGen-10k is a comprehensive benchmark for evaluating knowledge-graph QA and long-context retrieval, enabling controlled multi-hop and compositional reasoning assessments.
  • It employs systematic LLM-in-the-loop and template-driven methodologies to extract QA pairs from Wikidata and financial documents for rigorous testing.
  • Empirical results highlight challenges in multi-hop reasoning and set operations, exposing current limitations in LLMs and KG retrieval-augmented methods.

KGQAGen-10k is an evaluation benchmark for knowledge-graph-based question answering (KGQA) and long-context LLM retrieval, constructed by systematic extraction of question–answer (QA) instances from a knowledge graph (KG) representation of domain-specific documents or Wikipedia-grounded entities. The term refers both to a 10,787-instance Wikidata-centric QA set for KGQA robustness testing (Zhang et al., 29 May 2025) and the publicly-released 4,418 QA subset from a larger 20,139-example long-context KG-based QA benchmark derived from SEC financial agreements (Tatarinov et al., 18 May 2025). Both variants share a principled approach to KG grounding, multi-hop and compositional reasoning, and systematic template or LLM-in-the-loop generation, enabling controlled assessments of multi-hop reasoning, set operations, and semantic equivalence.

1. Construction Methodologies

KGQAGen-10k is the result of two related but distinct frameworks for KG-based QA:

KGQAGen employs a three-stage LLM-in-the-loop pipeline:

  • Seed Subgraph Initialization: Seed entities (16,000) are drawn from Wikipedia’s Level-5 Vital Articles. For each seed ee, a 1-hop Wikidata neighborhood (15 sampled triples) initializes Ge(0)G_e^{(0)}.
  • Iterative LLM-Guided Expansion: At each step tt, a LLM examines Ge(t)G_e^{(t)} to determine sufficiency for non-trivial, multi-hop questions. If insufficient, it suggests entity sets Ce(t)C_e^{(t)} for further 1-hop expansion (sampling 10–15 triples per entity), producing Ge(t+1)G_e^{(t+1)}. This proceeds until the subgraph supports a ≥2-hop question.
  • QA and Proof Generation: Upon sufficiency, the LLM produces: a natural-language question qeq_e, an answer set Ae\mathcal{A}_e, a minimal supporting subgraph Pe\mathcal{P}_e, and a SPARQL query Qe\mathcal{Q}_e, all serialized in strict JSON.
  • SPARQL Validation: The query Qe\mathcal{Q}_e is executed. If Ae=U~e\mathcal{A}_e = \tilde{\mathcal{U}}_e (SPARQL result), the instance is accepted. Otherwise, an automated, up to three-iteration LLM refinement cycle attempts query correction; only aligned instances are retained.

Here, 170 SEC credit-agreement documents are hand-annotated for seven key entity and relation types, each formalized as RDF graphs (V,E,R)(V, E, R). QA pairs are extracted by template-driven queries, parameterized by:

  • Hops (HH): Path-length through the KG (1, 2, or 3).
  • Plurality (PP): Singlet or plural answer set.
  • Set Operations (#\#SO): Union, intersection, or difference operations over answer sets, up to 2–3 per instance.

The composite difficulty level L=H+P+#L = H + P + \#SO partitions questions into easy (L=1L=1), medium (2L42 \leq L \leq 4), and hard (L=5L=5). Instance validity is enforced by graph traversal, operation cardinality, and schema constraints, with SPARQL-inspired pseudocode for reproducibility.

2. Dataset Structure and Statistics

  • Final filtered instances: 10,787.
  • Question length: 7.5% short (<16 w), 61.1% moderate (16–30 w), 31.4% long (>30 w).
  • Answer cardinality: 84.5% single, 5.8% two, 9.7% three or more answers.
  • By construction: All questions ≥2 hops.
  • Topic distribution: Arts (42.3%), Astronomy (17.3%), STEM (16%), other domains balance.
  • Total instances: 20,139; public “Dev” split (KGQAGen-10k): 4,418.
  • Difficulty distribution (Dev, 4,418): 1,499 easy (L=1L=1), 2,680 medium (2L42 \leq L \leq 4), 239 hard (L=5L=5).
  • QA per document: 14.75 (Dev avg), up to 83 per document.
  • Complexity breakdown: Explicit partitioning by hops, plurality, and set-operation count (see tables below).
Hops (H) Plurality (P) SetOps (#SO) Difficulty (L) # Questions
1 0 0 1 3,100
2 1 1 4 1,250
3 0 2 5 500

Table: Illustrative complexity breakdown; full table in (Tatarinov et al., 18 May 2025).

3. Formal Knowledge Graph Schemas

Both benchmarks utilize an explicit KG formalism:

  • Model: GE×R×E\mathcal{G} \subseteq \mathcal{E} \times \mathcal{R} \times \mathcal{E}, with entities E\mathcal{E} and relations R\mathcal{R}.
  • Subgraphs (Ge(t)G_e^{(t)}): Built iteratively around seed ee; expansions determined by LLM reasoning sufficiency.
  • Supporting evidence: Each QA pair linked to minimal supporting triples and executable SPARQL query.
  • Model: G=(V,E,R)G=(V,E,R); VV = entity instances, RR = edge types (hasRole, hasSubRole, hasPosition, etc.), EE = directed triples.
  • Ontology layer: Defines type and subclass hierarchy (e.g., PersonPosition rdfs:subClassOfrdfs:subClassOf Person).
  • Data layer: Per-document instantiation; global corpus aggregate denoted GglobalG_{\mathrm{global}}.

4. Evaluation Protocols and Metrics

Evaluation emphasizes both exact-set accuracy and semantic robustness:

  • Splits: Train/Dev/Test (Wikidata: 8,629/1,079/1,079; Financial: 4,418 public dev, 15,721 held-out test).
  • Matching schemes:
  1. Exact Match (EM): String equality after normalization.
  2. LLM-Assisted Semantic Match (LASM): GPT-4o-mini determines semantic equivalence if EM fails.
  • Metrics:
    • Accuracy (set match),
    • Hit@1 (top answer),
    • Precision, Recall, F1 (set overlap),
    • Word-level F1,
    • Normalized Levenshtein (EditDist),
    • Cosine Similarity (embedding),
    • LLM-as-a-Judge (1–5 scale).

Models evaluated include GPT-4.1, GPT-4o, DeepSeek-V3, PoG (GPT-4o), GCR (LLaMA-3.1 + GPT-4o), and others.

5. Empirical Results and Analysis

  • Wikidata KGQAGen-10k (Zhang et al., 29 May 2025):
    • Best LLMs (GPT-4.1): EM/LASM accuracy ≈47.4%/57.0%.
    • KG-RAG models: EM/LASM accuracy up to 51%/61% (PoG (GPT-4o)).
    • Upper bound (LLM-SP): LASM accuracy 84.9% with gold subgraph retrieval, confirming retrieval as primary limitation.
    • Failure patterns: Multi-hop (≥3), complex compositional, ambiguous retrieval, and SPARQL numeric/comparison constructs.
  • Financial KGQAGen-10k (Tatarinov et al., 18 May 2025):
    • Model F1: Drops from Easy ≈0.67 to Hard ≈0.12 (e.g., Gemini-2.0-Flash).
    • Set-ops and multi-hop: High failure (“not found” up to 77%), semantic drift, “lost-in-the-middle” phenomenon for long-context inputs.

Interpretation: These results indicate persistent failure modes in reasoning over multi-hop and compositional KG structures and expose the limitations of current LLMs and retrieval-augmented methods.

6. Quality Assurance, Pitfall Diagnosis, and Illustrative Cases

  • Problems in prior KGQA (WebQSP, CWQ): Incomplete answers, outdated facts, ambiguous queries, and rigid evaluation.
  • KGQAGen-10k safeguards:
  1. Grounding in latest Wikidata (April 2025).
  2. LLM-guided multi-hop generation to enforce question depth.
  3. Joint question–answer–proof output, ensuring direct KG evidence links.
  4. Symbolic SPARQL verification at generation time; 96.3% manual audit factual accuracy.
  5. SPARQL re-execution enables periodic revalidation as Wikidata evolves.

Representative Instances

Example Question Supporting Triples Answer
Johann Martin Schleyer “Who among the nominees for the Nobel Peace Prize was also the founder of the International Volapük Academy?” (Q12712) — P1411 — (Q35637); (Q3358168) — P112 — (Q12712) Johann Martin Schleyer (Q12712)
Astronomy & Astrophysics “What astronomical journal, published by EDP Sciences and edited by Thierry Forveille, succeeded Zeitschrift für Astrophysik as its immediate follower?” Q752075 — P123 — Q114404; Q752075 — P98 — Q46260676; Q3575110 — P156 — Q752075 Astronomy and Astrophysics (Q752075)

Plausible implication: These structured samples demonstrate the multi-hop and compositional demands encoded in KGQAGen-10k.

7. Impact and Recommendations

KGQAGen-10k establishes new standards for KGQA robustness, overcoming factual and methodological pitfalls prevalent in widely-used datasets. Both the Wikidata and financial-domain variants offer systematic multi-hop, compositional, and set-operation challenges, which reveal persistent upper limits for even state-of-the-art LLMs and KG-RAG methods. Recommendations include:

  • Integrating graph-reasoning neural modules or fine-tuning on template-driven multi-hop QA.
  • Developing retrieval-augmented and context-chunking approaches that preserve cross-chunk links for long-context LLMs.
  • Generalizing to new domains (e.g., legal contracts, global policy) for broader evaluation.

Public releases of code, KG schemas (RDF/Turtle), extraction queries, and leaderboard infrastructure for KGQAGen-10k facilitate standardized benchmarking and reproducibility in KGQA and long-context LLM evaluation research (Tatarinov et al., 18 May 2025, Zhang et al., 29 May 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to KGQAGen-10k.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube