TGQA Dataset for Temporal Reasoning

Updated 25 November 2025

TGQA is a fully synthetic dataset that converts unstructured stories into chronology-ordered temporal graphs for explicit temporal reasoning.
Its pipeline uses GPT-3.5 for graph extraction, anonymization, and story generation, ensuring accurate alignment of events and QA pairs.
Fine-tuning on TGQA shows substantial improvements in metrics (e.g., +17% EM on Llama2-13B) and demonstrates robust transfer to other temporal QA benchmarks.

TGQA is a fully synthetic dataset of text–graph–question triples developed to teach LLMs two core temporal reasoning (TR) skills: (1) converting unstructured stories into explicit, chronology-ordered temporal graphs (TGs), and (2) answering temporally structured queries by explicit, step-wise reasoning over those graphs. The TGQA dataset is a key component of the TG-LLM framework for language-based temporal reasoning, designed for rapid, low-supervision adaptation and robust transfer to external temporal QA benchmarks (Xiong et al., 12 Jan 2024).

1. Formal Structure of TGQA and Temporal Graphs

The foundational data structure in TGQA is the temporal graph (TG), formally defined as a finite set of event-timestamp tuples: $TG = \left\{\, (e_s^{(i)}, r^{(i)}, e_o^{(i)}; \tau^{(i)}_{\mathrm{start}},\,\tau^{(i)}_{\mathrm{end}})\, \right\}_{i=1}^{N}$ where each element consists of:

$e_s^{(i)}$ , $e_o^{(i)}$ : entity strings (from entity set $\mathcal{E}$ )
$r^{(i)}$ : relation type (from relation set $\mathcal{R}$ , e.g., "born in," "owned," "married to")
$\tau^{(i)}_{\mathrm{start}}$ , $\tau^{(i)}_{\mathrm{end}}$ : start and end timestamps (in integer years, $\mathbb{Z}$ )

Example in plaintext list format:

1
2
3

[1] (John Thompson, was born in, Weston; 1921, –)
[2] (John Thompson, owned, Pearl Network; 1942, 1967)
[3] (Sophia Parker, married to, John Thompson; 1947, 1953)

Each TGQA instance is a 4-tuple (story, graph, questions, answers), in which the natural-language story is aligned with its TG annotation and an associated set of synthetic QA pairs.

2. Synthetic Construction Pipeline

TGQA employs a highly controlled synthetic construction process:

Graph extraction and anonymization: Subgraphs (up to 25 events) are extracted around a seed entity from the YAGO11k temporal knowledge graph. Real-world names are globally mapped to anonymized labels using a GPT-3.5–generated mapping, making memorization of external world knowledge impossible and focusing the model on reasoning.
Story generation: Each subgraph is rendered as a coherent story by prompting GPT-3.5 to write short paragraphs that enumerate every event, with explicit start/end dates.
QA pair emission: For each anonymized graph, Python scripts generate approximately 25–35 question–answer pairs spanning eight temporal reasoning types (see §3 below).
Semi-automatic quality control: Each story-question pair is rerun through GPT-3.5. Only questions answered incorrectly by the model are flagged for subsequent manual verification, ensuring correctness while minimizing human labor.

The synthetic construction yields partitioned splits of 400 training, 100 development, and 100 test samples.

3. Reasoning Types and QA Distributions

TGQA systematically covers eight classes of temporally-oriented queries:

Sequencing: E.g., "Which event was first?"
Duration: E.g., "How long did event E last?"
Temporal-gap: "How much time elapsed between E₁’s start and E₂’s start?"
Before/After: "What happened right before E started?"
Factual extraction: "When did E occur?"
Simultaneity: "True/false: E₁ and E₂ in same year?"
Overlap: "True/false: E₁ was still happening when E₂ began?"
Comparative: "True/false: E₁ lasted longer than E₂?"

Query type coverage is balanced at the dataset level: sequencing questions comprise approximately 15–20% of queries, comparative 10%, with similar proportions for other types.

4. Dataset Statistics and Structure

Key dataset specifications include:

Samples: 400 train, 100 dev, 100 test
QA pairs: ≈30 per story → ≈12,000 train, 3,000 test
Event count: 5–25 events per story (peak at ~12)
Entities: 10–15 per subgraph

Each question, together with the corresponding story and TG, is supplied with a machine-generated answer. All QA pairs are generated by deterministic templates and require no manual authoring.

Split	# Stories	Avg. # Events	# QAs
Train	400	~12	~12,000
Dev	100	~12	–
Test	100	~12	~3,000

5. Annotation, Verification, and Minimal Supervision

Annotation relies on the direct mapping of YAGO11k events to story text and TG representations. Story-to-TG alignments are guaranteed by construction, eliminating the need for human labelers to invent or verify temporal facts. Quality control is performed through a light GPT-3.5–assisted QA check, supplemented by manual spot verification on flagged items. All QA is generated fully automatically from the TG with no manual authoring. No human chain-of-thought annotations are required; CoTs are bootstrapped in downstream training.

6. Downstream Use: Model Training and Fine-Tuning

TGQA's design supports a two-headed model architecture for LLM fine-tuning in a LoRA adapter framework:

Adapter #1 (Text→TG): Converts narrative story to an ordered TG.
Adapter #2 (TG→Answer): Consumes the full TG with the question and emits either a chain-of-thought (CoT) explanation or the final answer.

Two central training mechanisms are applied:

CoT Bootstrapping: Multiple CoTs are sampled from GPT-3.5 per question, with incorrect-answer CoTs discarded. Remaining CoTs are sampled proportional to an exponential function of their LM likelihood and a “contrastive growth” term that scores answer plausibility.
Graph-based Augmentation: Random deletion of irrelevant edges, synonym replacement on relation labels, global remapping of entity names, and uniform timestamp shifts counteract overfitting and rote memorization.

7. Empirical Performance and Impact on Temporal Reasoning

Fine-tuning on TGQA leads to substantial improvement in temporal reasoning metrics. On the TGQA test set (100 stories, ~3,000 QAs), a few-shot GPT-4 (ICL+CoT) baseline achieves EM=0.82, F1=0.87, Acc=0.85; Llama-2-13B few-shot attains only EM=0.63 / F1=0.76 / Acc=0.67. After TGQA fine-tuning, the Llama2-13B model reaches EM=0.80 / F1=0.85 / Acc=0.82, representing a 17% absolute EM improvement from baseline. Transfer to external datasets—TimeQA and TempReason—yields (with SFT) +0.10–0.15 EM and +0.10 F1 over vanilla fine-tuning or few-shot.

TGQA thus functions as a compact, pedagogical synthetic curriculum for explicit timeline construction and reasoning, enabling models to transfer temporal reasoning proficiency to real-world text tasks (Xiong et al., 12 Jan 2024).

PDF Markdown Chat (Pro)

References (1)

Large Language Models Can Learn Temporal Reasoning (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to TGQA Dataset.