Knowledge Graph & Complex Query Datasets

Updated 15 October 2025

Knowledge Graph and Complex Query Answering datasets are structured graph representations that model entities and multi-hop logical relationships using diverse query forms.
The dataset design employs abstract query graph enumeration, grounding techniques, and logical operator incorporation to yield non-trivial answer sets.
Benchmarks leverage detailed metrics and increasingly complex query types to drive robust, interpretable, and scalable KGQA system development.

A knowledge graph (KG) is a structured, graph-based data representation in which entities, concepts, and their relationships are modeled as nodes and edges. Complex query answering (CQA) over KGs involves retrieving or inferring answers to logically or structurally sophisticated queries, often requiring multi-hop reasoning or the composition of multiple relations, frequently in the presence of incomplete, ambiguous, or noisy data. The development of representative CQA datasets has been central to progress in this field, enabling benchmarking of algorithms addressing diverse query types, logical operators (conjunction, disjunction, negation, existential quantification), and graph features such as hyper-relationality or temporality.

1. Evolution and Taxonomy of CQA Benchmarks

Early KGQA and CQA datasets were grounded in triple-based knowledge graphs and primarily focused on atomic queries (single-relation or one-hop link prediction) as in FB15k or DBpedia. The introduction of multi-hop and multi-variable queries, compositional reasoning, and benchmarks targeting emerging logical challenges brought about a differentiation in dataset construction paradigms:

Pattern-based and template-based datasets (e.g., LC-QuAD, QALD) rely on a relatively small set of question/query templates covering simple to moderately complex operations, and often emphasize coverage or user relevance.
Combinatorial and generative benchmarks (e.g., EFO-1-QA (Wang et al., 2021), $\text{EFO}_k$ -CQA (Yin et al., 2023), WD50K-NFOL (Luo et al., 2022), Spider4SPARQL (Kosten et al., 2023)) explicitly enumerate a large space of logical query types, operators, or graph topologies—extending far beyond pattern-based approaches.
Hyper-relational and event-centric datasets (e.g., WD50K-NFOL (Luo et al., 2022), JF17k-HCQA and M-FB15k-HCQA (Tsang et al., 23 Apr 2025), ASER-based CEQA (Bai et al., 2023)) sample or construct queries over KGs with n-ary facts, temporal/eventuality semantics, or implicit logical constraints.
NLQ–SPARQL pair datasets (e.g., Spider4SPARQL (Kosten et al., 2023), QALD, LC-QuAD) emphasize natural language interfaces for user queries, paired with executable logical forms.

These diverse benchmarks facilitate a layered evaluation: from pure logical form execution to full-stack question answering in natural language.

2. Dataset Design Principles and Construction Methodology

The technical methodology for generating complex query answering datasets involves several key stages:

Abstract query graph enumeration: Formalized in $\text{EFO}_k$ -CQA (Yin et al., 2023), CQA queries are represented as constraint satisfaction problems (CSPs) over graphs whose nodes correspond to variables (free, existential), constants, and whose edges encode positive or negated relational constraints. Enumeration routines ensure a tractable yet expressive coverage of query types (e.g., connected, non-redundant, non-decomposable graphs with variable arity and negation).
Grounding and answer computation: For each abstract graph or template (e.g., in EFO-CQA or EFO-1-QA (Wang et al., 2021)), queries are instantiated (“grounded”) over a given knowledge graph such that they yield non-trivial answer sets. Negative edges are handled by ensuring that negated constraints remove meaningful subsets of the answer set, often utilizing CSP or backtracking solvers.
Incorporation of logical operators and normal forms: Benchmarks such as EFO-1-QA and WD50K-NFOL (Luo et al., 2022) carefully vary logical operator systems (e.g., set versus binary operators, De Morgan’s normal form, DNF), enabling tests of model transferability across logically equivalent but structurally distinct queries.
Implicit constraint modeling: CEQA datasets (Bai et al., 2023) incorporate not only explicit relation constraints but also implicit logical requirements (e.g., occurrence and temporal ordering of events), verified via theorem provers (e.g., Z3) to ensure logical consistency.
Dynamic NL/SPARQL pairing: Datasets such as Spider4SPARQL (Kosten et al., 2023) leverage automatic translation of NL/SQL resources into NL/SPARQL/KG triples while supporting a spectrum of query “hardness” from simple selection to multi-hop, aggregate, or nested queries.

These methodologies are tailored to ensure both expressive logical coverage and meaningful, non-trivial answer sets, supporting the evaluation of both symbolic and neural CQA methods.

3. Representative Benchmarks: Scope and Metrics

A range of contemporary datasets illustrate the breadth and depth of modern CQA evaluation:

Dataset	Key Features	Reference
EFO-1-QA	301 EFO-1 query types, 7 operator systems, DNF/DM	(Wang et al., 2021)
$\text{EFO}_k$ -CQA	741 query types, multi-variable, beyond set ops	(Yin et al., 2023)
WD50K-NFOL	N-ary queries, existential, conjunction, negation	(Luo et al., 2022)
JF17k/M-FB15k-HCQA	Hypergraph, n-ary, 14 query types incl. negation	(Tsang et al., 23 Apr 2025)
CEQA (ASER)	Eventualities, implicit temporal/occurrence logic	(Bai et al., 2023)
Spider4SPARQL	4,700+ SPARQL, 9,000+ NL, 166 KGs, 138 domains	(Kosten et al., 2023)

To evaluate CQA methods, benchmarks employ a multiplicity of metrics aligned with the structure of the answer space:

Single free variable (EFO-1): Standard ranking metrics (MRR, Hits@K) on projected answers.
Multiple free variables ( $\text{EFO}_k$ ): Marginal (per-variable), multiply (product of individual ranks), and joint metrics (ranking entire answer tuples in the combinatorial space).
Execution-based accuracy: Used in full-stack KGQA, this metric compares the executed SPARQL query’s answer set to the ground truth; in Spider4SPARQL, this reduces to strict (exact match) set comparison.
Logical consistency and task-specific measures: In CEQA, logical consistency is ensured by theorem proving; Hits@1 and MRR are reported on non-contradictory answer sets.

The differentiation of metrics is necessitated by the combinatorial increase in answer set size as queries grow in complexity and dimensionality.

4. Impact on CQA Model Development and Analysis

The introduction of systematically constructed CQA datasets has driven several advances and insights in model development and analysis:

Combinatorial generalization: EFO-1-QA (Wang et al., 2021) demonstrates that model training on restricted operator combinations (e.g., up to 2-hop or simple intersections) may not guarantee generalization to deeper or more composite queries. The choice of operators and normal forms can markedly influence generalization.
Handling of negation, disjunction, and multi-variable queries: WD50K-NFOL (Luo et al., 2022) and $\text{EFO}_k$ -CQA (Yin et al., 2023) expose the limitations of models designed only for set-operator queries and force the handling of true logical formulas with disjunction and negation beyond tree-like structures.
Benchmark-driven interpretability and transparency: QTO (Bai et al., 2022) and CKGC (Xiao et al., 30 Sep 2024) highlight benchmarks requiring not only answer prediction but also the ability to recover or interpret intermediate reasoning steps, necessitating models capable of providing explanations or variable assignments.
Efficiency and scalability constraints: As illustrated in (Fei et al., 13 May 2025), the explosion of combinatorial search space in symbolic approaches (especially for cyclic queries) compels the use of domain-reduction heuristics and approximate search, which are best tested on benchmarks with large entities and complex query graphs.

A plausible implication is that compositional and logic-rich datasets function not only as testbeds but also as design drivers for more modular, generalizable, and efficient model architectures.

5. Open Challenges, Dataset Biases, and Future Directions

Current research, as exemplified in (Yin et al., 2023), identifies systematic biases and open questions in the construction and use of CQA benchmarks:

Set operation bias: Many earlier datasets are dominated by set operation-based queries, masking difficulties presented by queries with cycles, multiple free variables, or interleaved negation/disjunction, often leading to overestimated model generalizability.
Balancing expressivity and tractability: Efforts to expand the query combinatorics (e.g., $\text{EFO}_k$ -CQA, WD50K-NFOL) introduce computational and annotation complexity; mapping the right subset of query structures to operationally feasible but expressive datasets remains challenging.
Real-world complexity and domain diversity: Datasets such as Spider4SPARQL (Kosten et al., 2023) raise the bar for linguistic and logical complexity by covering multiple domains, ontologies, and query features (aggregation, subqueries), revealing that even state-of-the-art models lag far behind human performance in realistic, domain-heterogeneous settings.
Calibration and logical reasoning: Methods like CKGC (Xiao et al., 30 Sep 2024) demonstrate that score calibration is critical for fuzzy logic-based reasoning, but the calibration step depends on dataset coverage and the range of true/false facts in the benchmark.
Hyper-relationality and temporality: Continued extension of benchmarks to include n-ary facts, qualifiers, and event-specific logic (e.g., in JF17k-HCQA (Tsang et al., 23 Apr 2025), CEQA (Bai et al., 2023)) reflects the demand for query answering beyond traditional entity-relation modeling, including rigorous evaluation of temporal and occurrence constraints.

Future benchmarks are likely to address hybrid symbolic-neural reasoning, multi-modal KGQA, and interpretable logical inference across multi-domain, hyper-relational, and temporally enriched graphs.

6. Implications for Practical KGQA Systems

Comprehensive CQA datasets directly inform the development and deployment of practical KG and QA systems:

User-friendly interfaces: Approaches such as GQBE (Jayaram et al., 2013), which query by example tuples rather than structure, exemplify paradigms that enable usability for non-experts, complementing traditional query languages or keyword-based interfaces.
Systemic evaluation and error analysis: Datasets organized by template pattern, logical form, and query hardness (as in Spider4SPARQL (Kosten et al., 2023)) permit fine-grained diagnosis of strengths and weaknesses in KGQA systems, guiding targeted improvements in entity linking, graph completion, or answer ranking modules.
Benchmark-driven progress in ranking and transfer learning: The structure and size of datasets like LC-QuAD and QALD-7/8/9 support research into neural ranking of query graphs (Maheshwari et al., 2018), transfer learning across datasets, and template-based classification (Vollmers et al., 2021).
Dynamic and scalable knowledge graphs: Ongoing work on dynamic and event-centric KGs, as detailed in (Mohamed et al., 17 Dec 2024), envisions systems capable of real-time updates and reasoning, leveraging the rigor and design principles established in current CQA dataset practices.

In summary, the methodology, spectrum, and analytical depth of knowledge graph and complex query answering datasets are foundational in advancing both the theory and practice of KG-driven intelligent systems. They provide not only robust evaluation platforms, but also conceptual scaffolds for next-generation KG reasoning and question answering frameworks.