Time-Series Question Answering (TSQA)
- Time-Series Question Answering (TSQA) is a research field that processes natural language queries on time-stamped data, integrating temporal reasoning, forecasting, and multimodal analysis.
- It employs methods such as temporal expression extraction, multi-hop inference, and data fusion from texts, graphs, and sensor signals to address dynamic information needs.
- Applications span finance, healthcare, and logistics, where TSQA systems enhance decision support by ensuring timely, accurate retrieval of temporal facts.
Time-Series Question Answering (TSQA) is a research area concerned with designing, evaluating, and deploying models and systems that answer natural language questions involving temporal information, time-evolving facts, or reasoning directly from time-stamped data sources. TSQA encompasses a spectrum of methodologies—ranging from text-centric temporal QA and knowledge-graph-based temporal QA, to multi-modal systems integrating numerical time series and contextual language, and explicit time series signal analysis via domain-specific agents. The field addresses critical questions in dynamic domains such as forecasting, factual temporal knowledge retrieval, scenario-driven planning, and robust multi-hop temporal reasoning.
1. Task Formalization and Subproblem Taxonomy
TSQA tasks are defined by their reliance on time-evolving inputs and require models to restrict retrieval or reasoning to evidence aligned with temporal constraints. This includes:
- Forecasting-oriented QA: Given a corpus of news articles or historical records with timestamps, answer questions about events that occur after the latest available information, using only past data (e.g., ForecastQA (Jin et al., 2020)).
- Temporal KGQA: Answering questions over temporal knowledge graphs (KGs), inferring time intervals, entity roles, or factual transitions via timestamp estimation and temporal order modeling (e.g., (Shang et al., 2022)).
- Streaming and Continually Evolving QA: Models must adapt knowledge as new sources become available over time, balancing adaptation with retention (e.g., StreamingQA (Liška et al., 2022), CLTSQA (Yang et al., 17 Jul 2024)).
- Temporal Reasoning over Text and Multimodal Inputs: Handling cross-modal QA involving numerical time series with associated natural language (e.g., Chat-TS (Quinlan et al., 13 Mar 2025), MTBench (Chen et al., 21 Mar 2025), ITFormer (Wang et al., 25 Jun 2025)).
Critical subproblems include:
- Temporal expression extraction and normalization;
- Reasoning over explicit and implicit temporal constraints;
- Multi-hop temporal inference;
- Temporal rationale faithfulness in answers;
- Handling diachronic and multimodal corpora.
2. Datasets and Benchmarks
A rich variety of datasets drive TSQA research, reflecting diversity in data modality, question complexity, and temporal reasoning depth:
Dataset/Benchmark | Data Types | Key Focus | Scale / Coverage |
---|---|---|---|
ForecastQA (Jin et al., 2020) | News Text (time-stamped) | Event Forecasting | 10,392 Q-A pairs, 5 years |
StreamingQA (Liška et al., 2022) | News Text, Timelines | Adaptation, Drift | 14 years, quarterly splits |
ComplexTempQA (Gruber et al., 7 Jun 2024) | Wikipedia/Wikidata | Multi-hop, Large | 100M+ pairs, 36 years |
EngineMT-QA (Wang et al., 25 Jun 2025) | Sensor TS + Text | Multimodal QA | 110K Q-A pairs, real-world |
MTBench (Chen et al., 21 Mar 2025) | Financial/Weather TS + Text | Cross-modal QA | Multi-domain, labeled tasks |
TDBench (Kim et al., 4 Aug 2025) | Temporal DB | Factual QA, Eval | 6K+ pairs, 13 operators |
CLTSQA-Data (Yang et al., 17 Jul 2024) | WikiData/Text | Continual Learning | 50K Qs, ∼5K contexts, staged |
UnSeenTimeQA (Uddin et al., 3 Jul 2024) | Synthetic Scenarios | Reasoning-only | Unlimited, no web leakage |
Significant advances in dataset construction include:
- Systematic use of temporal SQL, temporal functional dependencies, and temporal joins for scalable QA generation (e.g., TDBench (Kim et al., 4 Aug 2025));
- Synthetic, contamination-free settings to stress pure temporal reasoning (e.g., UnSeenTimeQA (Uddin et al., 3 Jul 2024));
- Massive coverage both in terms of modalities and reasoning depth (e.g., ComplexTempQA (Gruber et al., 7 Jun 2024), MTBench (Chen et al., 21 Mar 2025)).
3. Methodologies and Model Architectures
Approaches in TSQA span several paradigms reflecting both linguistic and numerical aspects:
Temporal Text and Knowledge Graph QA
- Temporal Cutoff Enforcement: Strictly limiting accessible evidence to pre-specified time points to simulate real-world forecasting (e.g., ForecastQA (Jin et al., 2020)).
- Timestamp Estimation and Temporal Embeddings: Inferring latent timestamps from questions, employing multi-linear interactions and sinusoidal positional encodings (e.g., TCompLEx score: (Shang et al., 2022)).
- Contrastive and Auxiliary Losses: Enforcing temporal order and contrastive learning over question pairs differing only in time expressions (Shang et al., 2022, Son et al., 2023, Yang et al., 17 Jul 2024).
- Temporal Graph Extraction and Fusion: Construction of event–time–relation graphs (via CAEVO, SUTime), with fusion by explicit edge representation or GNN modules in transformers (e.g., ERR fusion, RelGraphConv update) (Su et al., 2023).
Multimodal and Time-Series Integration
- Time-Series Encoders Coupled to LLMs: Models like ITFormer (Wang et al., 25 Jun 2025) employ hierarchical position encoding (temporal, channel, segment), learnable instruction tokens, and instruct time attention to align/fuse time-series representations with frozen LLMs.
- Discrete Time-Series Tokenization: Methods such as Chat-TS (Quinlan et al., 13 Mar 2025) convert numerical series to discrete tokens, extending LLM vocabulary for direct joint reasoning.
- Program-Aided Decomposition: Domain agents such as TS-Reasoner (Ye et al., 5 Oct 2024) translate natural language into structured workflows, execute precise numeric/statistical computations, and incorporate domain knowledge, with adaptive self-refinement.
Learning with Noisy or Pseudo-Labels
- Pseudo-Labeling via VLMs: Large-scale TSQA models can be effectively trained with labels produced by VLMs (e.g., GPT-4o), exploiting the noise robustness of DNNs to achieve accuracy higher than the pseudo-label generator (Fujimura et al., 30 Sep 2025).
4. Evaluation Methodologies and Metrics
Multiple tailored metrics and evaluation protocols have been introduced for TSQA:
- Traditional QA Metrics: Exact Match (EM), F1, set-level accuracy for multi-answer cases (Tan et al., 2023, Kong et al., 26 Feb 2025).
- Time Accuracy (T) and Answer-Time Accuracy (AT): Evaluating not only the returned answer but the correctness of temporal justifications, with partial credit for cases where only some required dates are correct (Kim et al., 4 Aug 2025):
- Brier Score: Calibration of probabilistic predictions (Jin et al., 2020):
- Domain-Specific Success Metrics: E.g., Absolute Average Profit, Relative Average Profit, and MAPE for time-series inference tasks (Ye et al., 5 Oct 2024).
5. Current Results, Robustness, and Open Challenges
Analysis across established benchmarks demonstrates:
- Even the best performing BERT-based models on event forecasting lag human judgment by at least 10–19% accuracy (e.g., 60.1% vs. 71.2–79.4% in ForecastQA (Jin et al., 2020)).
- Systems that model temporal order explicitly (e.g., GRU-based aggregation) and contrastive learning on time-expressions show marked improvements, e.g., 32% absolute error reduction in temporal KGQA (Shang et al., 2022).
- Multimodal models like ITFormer outperform adapted vision–language approaches and general LLMs, while using fewer than 1% additional trainable parameters (Wang et al., 25 Jun 2025).
- Robustness analysis (e.g., UnSeenTimeQA (Uddin et al., 3 Jul 2024)) reveals that LLMs excel at shallow or memorization tasks but degrade significantly for multi-step event dependencies and parallel events (up to 45% performance drop on hard splits).
- In factual, database-driven QA, significant time hallucination persists—average drops of 21.7% when correctness of temporal references is explicitly required alongside content (Kim et al., 4 Aug 2025).
6. Future Directions and Open Research Problems
Open challenges and future directions, as outlined across recent work, include:
- Automated Adaptation and Continual Learning: Frameworks combining temporal memory replay and contrastive learning (as in CLTSQA (Yang et al., 17 Jul 2024)) are necessary to cope with knowledge drift and catastrophic forgetting in dynamic environments.
- Faithfulness in Temporal Justification: Methods enforcing and evaluating the temporal consistency of answer rationales (e.g., the Faith framework (Jia et al., 23 Feb 2024), TDBench (Kim et al., 4 Aug 2025)) are critical for high-stakes domains.
- Fine-Grained Temporal and Multi-Hop Reasoning: Dataset design, augmentation strategies (e.g., pseudo-instruction tuning, temporal shifting (Tan et al., 2023)), and complex temporally stratified benchmarks (e.g., ComplexTempQA (Gruber et al., 7 Jun 2024)) are central for progress.
- Scalability, Efficiency, and Domain Adaptation: Efficient lightweight modules connecting structured TS encoders to LLMs, parameter-efficient fine-tuning, and domain-specific module generation are demonstrated to be effective (e.g., ITFormer (Wang et al., 25 Jun 2025), TS-Reasoner (Ye et al., 5 Oct 2024)).
- Evaluation Beyond Memorization: Synthetic, contamination-free settings (e.g., UnSeenTimeQA (Uddin et al., 3 Jul 2024)) and robust pseudo-labeling techniques (e.g., (Fujimura et al., 30 Sep 2025)) allow for stringent evaluation of true reasoning versus retrieval or memorization.
7. Practical Applications and Impacts
TSQA methods are foundational in:
- Policy and civil unrest forecasting from news streams (Jin et al., 2020)
- Fact-checking temporal claims from structured/unstructured sources (Jia et al., 23 Feb 2024, Kim et al., 4 Aug 2025)
- Healthcare and patient monitoring, finance, and IoT analysis via multimodal TSQA (Quinlan et al., 13 Mar 2025, Kong et al., 26 Feb 2025, Wang et al., 25 Jun 2025)
- Automated scenario planning and resource allocation in logistics; industrial monitoring (e.g., aeronautical engines, manufacturing processes) (Wang et al., 25 Jun 2025)
- Personalized assistants and decision support that combine narrative context and time series prediction (Ye et al., 5 Oct 2024).
The field’s continued innovation in scalable benchmarks, robust reasoning modules, cross-modal architectures, and faithfulness evaluation is steadily bridging the gap between machine and human capabilities in temporal reasoning and time-sensitive decision making.