PAT-Questions: Temporal QA Benchmark
- PAT-Questions is a benchmark that defines temporal QA using present-anchored queries, where the answer depends on dynamically evolving facts.
- It leverages automated SPARQL queries based on Wikidata snapshots to create both single-hop and multi-hop queries that refresh gold labels regularly.
- Evaluation shows current models struggle with knowledge staleness and multi-hop reasoning, highlighting the need for dedicated temporal modules and hybrid approaches.
Present-Anchored Temporal Question-Answering and the PAT-Questions Benchmark
Present-Anchored Temporal Question-Answering (PATQA) comprises a class of temporal QA problems where the query context or its answer depends upon a fact or event that is anchored to the present (now). Unlike traditional Temporal QA—where the timestamp of reference is made explicit in the question text—PATQA requires the system to resolve temporal expressions like "currently," "now," or relational quantifiers such as "previous" or "next" relative to the current time. Consequently, the field faces unique challenges in maintaining answer correctness over time, modeling temporally dependent relations, enabling multi-hop temporal reasoning, and sustaining the currency of QA benchmarks. The PAT-Questions benchmark, introduced by Zhang et al., provides the first large-scale, self-updating testbed for systematically evaluating LLMs and temporal reasoning models on these distinct PATQA problems (Meem et al., 2024).
1. Definition, Challenges, and Motivation
PATQA considers queries such that their correct interpretation and the selection of the gold answer depend on the present timestamp . Formally, a PATQA instance is defined as - Subject entity (denoted by Wikidata Q-ID). - Primary relation (Wikidata P-ID). - Temporal constraint expressed relative to (e.g., "current," "before the current," "previous," etc.). - Objective: Return object such that holds in the knowledge graph for the value(s) of determined by 0.
Unlike canonical TQA, four technical challenges are intrinsic to PATQA:
- Knowledge Staleness: Pre-trained LLMs and TQA models are anchored to a fixed data cutoff, quickly lagging for dynamic entities and contemporary facts.
- Complex Temporal Relation Reasoning: Expressions such as "before the current team," "previous president," or "next head coach" require explicit modeling of time intervals, successor/predecessor computations, and start/end qualifiers.
- Multi-hop Temporal Reasoning: Some PATQA queries require a chain of intermediate relations to be evaluated in the context of 1, compounding temporal reasoning complexity.
- Benchmark Maintenance: Temporal facts in 2 change as world knowledge evolves. Maintaining the validity of benchmark gold labels necessitates continuous, automated updating to avoid rapid stagnation.
2. Construction and Mechanics of the PAT-Questions Benchmark
PAT-Questions comprises 6,172 questions partitioned into single-hop and multi-hop present-anchored queries, each associated with a natural language template and a SPARQL query for answer retrieval from Wikidata (Meem et al., 2024).
- Template Generation: Queries are constructed from entity-relation pairs found in two Wikidata snapshots (Dec 31, 2021 and Dec 31, 2023).
- Single-hop: Templates like "Which team does {s} play for currently?" (relation P54) or "Which team did {s} play for before the current team?" are instantiated with appropriate entities and relations.
- Multi-hop: Extends single-hop by nesting another relation within the answer to obtain queries such as "Who is the head coach of the team that {s} plays for currently?"
- Automated SPARQL-driven Updating: Each question is paired with a parameterized SPARQL query. At each maintenance interval, the benchmark is refreshed by re-executing these queries against the latest KG dump to regenerate gold answers, aligning QA labels dynamically with world knowledge changes.
- Statistics:
- 2,882 single-hop ("current" or "before-current") and 3,290 multi-hop.
- 12 distinct Wikidata relations; second-hop relations commonly include P286 (head coach), P115 (home venue).
- Two answer sets per question (2021 and 2023), enabling analysis of LLM knowledge staleness.
3. Modeling Requirements and Evaluation Setup
PATQA demands explicit modeling of several factors absent or less critical in standard TQA:
- Temporal Expression Resolution: Parsing temporal constraints 3 into SPARQL filters (e.g., filtering on P580 (start) and P582 (end)) and generating appropriate ORDER BY / LIMIT logic to select the fact active at 4.
- Multi-hop Execution: For queries involving more than one relation, models must sequence extractions (e.g., resolve the "current team" and then query their "head coach," both at 5).
- Evaluation: Baseline LLMs (Falcon-7B-Instruct, Flan-T5-XL, Llama-2-7B, Mistral-7B, GPT-3.5-Chat) and the state-of-the-art temporal model TEMPREASON-T5-SFT are assessed under two prompting regimes:
- Direct: Standard prompting with or without context.
- RAG: Retrieval-augmented generation, extracting up-to-date Wikipedia passages for LLM context.
- Metrics: Exact Match (EM), token-level F1, and error breakdowns (outdated answer, refusal, hallucinatory errors).
4. Core Empirical Findings
Evaluation on Dec 2023 Wikidata (reflecting most out-of-cutoff scenarios for 2022-capped LLMs) exposes limitations of current methods:
- Absolute Performance: On single-hop queries, EM varies 1.5-15.5%; F1 is 2.9-16.5%. Multi-hop is even lower (EM: 1.5-9.3%).
- Effect of Retrieval-Augmentation: RAG yields modest improvement in single-hop EM (e.g., Flan-T5-XL: 2.0→14.9) but is far less effective for multi-hop.
- Knowledge Staleness: Even matching snapshot evaluation to LLM cutoff (2021 facts) does not close the performance gap (e.g., Llama-2-7B: 10.0% EM in 2021 vs. 8.4% in 2023).
- Error Analysis: Up to 60% of wrong answers are due to models giving outdated (in-cutoff, but now-incorrect) facts. RAG reduces hallucinations but increases refusals in some cases (e.g., GPT-3.5).
- Temporal Reasoning Difficulty: No statistically significant improvement is observed in "before-current" vs. "current" or in multi-hop over single-hop, indicating brittleness even on first-order temporal constructs.
5. Benchmark Self-Updating System and Reproducibility
PAT-Questions implements a robust, fully automatable mechanism for refresh:
- Every question 6 has an associated SPARQL query 7.
- Periodic cronjobs (quarterly) ingest the latest Wikidata, run 8 for all 9, and replace the gold answer field.
- All templates, SPARQL schemas, and code are released for reproducibility and extension, enabling research teams to augment, repurpose, or align with other time-evolving KGs.
This design ensures that evaluations on PAT-Questions are never stale, reflecting the true present-anchored state of the world's knowledge at test time.
6. Implications, Model Directions, and Future Extensions
Findings from the PAT-Questions benchmark underscore the necessity for explicit temporal modules, present-time reasoning objectives, and possibly hybrid neuro-symbolic approaches in LLMs and downstream QA systems:
- Explicit Temporal Modules: Augment LLMs with components capable of interval arithmetic, timestamp comparison, and sequence reasoning over temporal relational graphs.
- Present-Anchoring Finetuning: Supervised or unsupervised objectives that explicitly enforce correct present anchoring in chain-of-thought and fact extraction.
- Extended Multi-hop and Temporal Logic: Extend benchmark question depth and complexity (e.g., "second successor," "first predecessor," long-range chaining) to more closely approximate real-world temporal queries.
- Hybrid Neuro-Symbolic Systems: Develop LLM-based agents capable of calling external SPARQL executors or logical solvers, followed by post-hoc natural language justification.
Furthermore, the authors highlight the criticality of dynamic, self-updating evaluation methodologies for any benchmark focused on temporal reasoning, as static test sets rapidly lose relevance in the PATQA context. The PAT-Questions benchmark sets a template for such self-maintaining datasets and for the rigorous, longitudinal evaluation of models tasked with present-anchored temporal QA (Meem et al., 2024).