CronQuestions: Temporal KGQA Benchmark
- CronQuestions is a benchmark dataset that tests temporal question answering over evolving knowledge graphs with 410,000 natural language questions derived from a time-scoped Wikidata subgraph.
- The dataset employs rigorous template-based synthesis and stratifies questions by reasoning complexity, including simple, before/after, first/last, and time join queries.
- Evaluation metrics such as Hits@1 and MRR demonstrate robust results on simple queries but highlight challenges in complex temporal reasoning, driving advancements in KGQA methods.
CronQuestions is the largest benchmark to date for temporal question answering (QA) over knowledge graphs (KGs), providing a comprehensive testbed for temporal reasoning over facts situated within time-evolving, structured datasets. As introduced by Saxena et al. (2021), the dataset consists of 410,000 natural-language questions automatically generated from a filtered, temporally scoped subgraph of Wikidata. CronQuestions is recognized for its stratification by reasoning complexity, rigorous template-based synthesis, and the explicit design to evaluate both entity prediction and temporal slot-filling in a non-forecasting, fully observable setting (Saxena et al., 2021, Ding et al., 2022, Liu et al., 2023). The dataset serves as the foundation for benchmarking state-of-the-art temporal KGQA methods and has precipitated critical discussion regarding evaluation protocols, generalization, and the distinction between interpolation and extrapolation in temporal knowledge graph QA.
1. Dataset Construction and Coverage
CronQuestions is constructed atop a time-evolving subgraph of Wikidata. The base knowledge graph contains only those facts for which subject, relation, object, and timestamps are all present and validated (Saxena et al., 2021).
- Underlying KG: The dataset employs a filtered, temporal subset of Wikidata, enriched by event extraction and timestamp normalization.
- Temporal Window: The main window spans 2002–2016 (inclusive), covering all Wikidata facts with complete temporal information within this interval (Ding et al., 2022).
- Preprocessing: Date-times are truncated to integer years; relations with outsized frequencies (notably “member of sports team”) are downsampled for balance; entity alias normalization is applied; additional “event” facts are constructed where start/end delimiters are explicit (Saxena et al., 2021).
- Final statistics (approximate):
- Distinct entities: ≈ 50,000 (Ding et al., 2022) (original source reports 125,000 (Saxena et al., 2021))
- Relations: ≈ 200
- Timestamps: ≈ 300 (predominantly years between 2002 and 2016)
- Total quadruple facts: ≈ 1.6 million (Ding et al., 2022)
- Total time-scoped triples: 328,000 (discrepancy due to post-filtering definitions) (Saxena et al., 2021)
The extracted QA pairs (total: 410,000) cover two principal slot-filling query types:
- Entity-prediction (≈ 85%): “Which ⟨relation⟩ did ⟨X⟩ ⟨in year⟩?”
- Time-prediction (≈ 15%): “When did ⟨X⟩ ⟨relation⟩ ⟨Y⟩?” (Approximate counts: 350,000 entity vs. 60,000 time questions (Ding et al., 2022).)
2. Question Synthesis and Structural Stratification
CronQuestions employs a template-driven question synthesis procedure, differentiating itself by both the magnitude of its template pool and the granularity of its stratification by reasoning type.
- Templates: Initially, 30 slot-fill templates per predicate-reasoning type combination, expanded via human paraphrasing (246 unique patterns) and machine paraphrasing (up to 654 distinct templates) (Saxena et al., 2021).
- Template filling: Entities and time values from KG facts are inserted directly into template slots.
- Reasoning types: The dataset is stratified by both answer type (entity vs. time) and structural complexity:
- Simple Entity (“Who received X in 2001?”): 1-hop facts, entity answer.
- Simple Time (“When did X play for Y?”): 1-hop facts, time answer.
- Before/After: Temporal ordering between two facts (e.g., “Who was president before X?”).
- First/Last: Min/max temporal queries (e.g., “Who was the first/last Y to ...?”).
- Time Join: Intersection of time intervals across distinct facts (e.g., “Who played with X during Y's tenure?”).
Language: Natural language is generated using canonical English aliases and paraphrased to increase linguistic realism, though the source repeatedly notes that templates are still predominantly synthetic (Saxena et al., 2021).
The dataset is partitioned into explicit train (350,000), dev (30,000), and test (30,000) sets. Notably, entity overlap is excluded between train and test splits to assess true generalization.
3. Evaluation Protocols and Metrics
CronQuestions tasks models with answering temporal KGQA queries over the fully observed time window. All facts within [2002,2016] are available—no artificial temporal cutoffs restrict the information accessible at inference (Ding et al., 2022).
- Standard metrics:
where is the rank assigned to the correct answer (for entity or time slot-filling) among all candidates.
- Primary evaluation splits: Models are evaluated on both simple and complex buckets, as well as by answer type and reasoning motif (Saxena et al., 2021, Liu et al., 2023).
- No forecasting: Since models may use future facts (with respect to the query time), tasks only require interpolation, not extrapolation (Ding et al., 2022).
4. Baseline Methods and Benchmarking Results
Multiple baseline architectures and benchmarking results—both contemporary and subsequent to Saxena et al.—have been reported using CronQuestions (Saxena et al., 2021, Liu et al., 2023).
- Baselines: Pure pretrained LLMs (BERT, RoBERTa, KnowBERT, T5-3B), static KGQA models (EmbedKGQA), temporal KG embedding models (T-EaE-*), and hybrid solutions (CronKGQA combining BERT and TComplEx).
- Key results (test set, Hits@1):
| Model | Simple | Complex | Before/After | First/Last | Time Join | Overall |
|---|---|---|---|---|---|---|
| BERT (PLM) | 0.075 | — | — | — | — | — |
| EmbedKGQA | 0.290 | 0.199 | 0.199 | 0.324 | 0.223 | — |
| T-EaE-add | 0.313 | 0.256 | 0.256 | 0.285 | 0.175 | — |
| CronKGQA | 0.987 | 0.392 | 0.288 | 0.371 | 0.511 | 0.647 |
| TMA | 0.987 | 0.632 | 0.581 | 0.627 | 0.675 | 0.784 |
- Interpretation: While simple 1-hop temporal QA is essentially solved by strong hybrid approaches (CronKGQA and subsequent TMA both achieve ≈0.99 Hits@1), complex multi-hop temporal reasoning (e.g., “before/after”) remains challenging: CronKGQA yields ≈0.39 Hits@1 overall on complex queries, and state-of-the-art TMA pushes this to 0.63 (+24 points on complex only). Entity answers are generally easier than time answers (Saxena et al., 2021, Liu et al., 2023).
- MRR: CronKGQA achieves MRR ≈ 0.53 on entity-prediction questions; further improvements are reported in subsequent work (Ding et al., 2022).
5. Limitations and Identified Gaps
Several substantive limitations of CronQuestions are explicitly identified (Ding et al., 2022):
- Non-forecasting setting: Models have access to all facts in the observed window and may exploit future knowledge, making real-world deployment settings (which require prediction or restricted access) unrealistic.
- Narrow question types: Only supports entity-prediction and time-prediction (slot filling); does not include yes/no, multiple-choice, or complex fact-reasoning queries.
- Synthetic language: Despite template paraphrasing, questions are ultimately derived from fixed patterns rather than authentically crowd-sourced or organically authored.
- Temporal reasoning bottlenecks: Performance on complex temporal queries, especially those involving interval arithmetic or multi-fact comparison, remains low and highlights the brittleness of current models.
6. Impact, Successors, and Research Directions
CronQuestions is widely adopted as the primary benchmark for temporal KGQA, driving the development of hybrid, temporal-aware QA systems and serving as the de facto testbed for evaluating cross-disciplinary methodologies at the intersection of LLMs and temporal KG embedding (Liu et al., 2023). Its limitations catalyzed the design of subsequent datasets—notably ForecastTKGQuestions—which impose temporal cutoff restrictions (i.e., only permitting access to ), introduce true forecasting queries, and expand question types to yes/unknown and fact-reasoning settings (Ding et al., 2022).
A plausible implication is that, while CronQuestions ensures coverage and complexity within an interpolation paradigm, the need for authentic temporal extrapolation and broader QA formats remains a central open challenge. This motivates ongoing work in dataset construction, model design (e.g., interval logic, multi-interval query handling), and evaluation protocols targeting real-world applicability.
7. Data Access and Format
The full dataset, including the underlying temporal KG and question splits, is publicly available for research use at https://github.com/apoorvumang/CronKGQA (Saxena et al., 2021). Data is provided in standardized formats:
- KG: TSV/CSV files of 5-tuples (subject_QID, predicate, object_QID, start_year, end_year)
- Questions: JSON lines with fields (“question”: string, “entities”: [QIDs], “times”: [years], “answers”: [QIDs or years]) Licensing follows open-source conventions (exact terms to be checked in the repository). Users are advised to cite Saxena et al. (2021) when employing the resource in published research.