Papers
Topics
Authors
Recent
2000 character limit reached

Time-Series Question Answering (TSQA)

Updated 8 October 2025
  • Time-Series Question Answering (TSQA) is a research field that processes natural language queries on time-stamped data, integrating temporal reasoning, forecasting, and multimodal analysis.
  • It employs methods such as temporal expression extraction, multi-hop inference, and data fusion from texts, graphs, and sensor signals to address dynamic information needs.
  • Applications span finance, healthcare, and logistics, where TSQA systems enhance decision support by ensuring timely, accurate retrieval of temporal facts.

Time-Series Question Answering (TSQA) is a research area concerned with designing, evaluating, and deploying models and systems that answer natural language questions involving temporal information, time-evolving facts, or reasoning directly from time-stamped data sources. TSQA encompasses a spectrum of methodologies—ranging from text-centric temporal QA and knowledge-graph-based temporal QA, to multi-modal systems integrating numerical time series and contextual language, and explicit time series signal analysis via domain-specific agents. The field addresses critical questions in dynamic domains such as forecasting, factual temporal knowledge retrieval, scenario-driven planning, and robust multi-hop temporal reasoning.

1. Task Formalization and Subproblem Taxonomy

TSQA tasks are defined by their reliance on time-evolving inputs and require models to restrict retrieval or reasoning to evidence aligned with temporal constraints. This includes:

  • Forecasting-oriented QA: Given a corpus of news articles or historical records with timestamps, answer questions about events that occur after the latest available information, using only past data (e.g., ForecastQA (Jin et al., 2020)).
  • Temporal KGQA: Answering questions over temporal knowledge graphs (KGs), inferring time intervals, entity roles, or factual transitions via timestamp estimation and temporal order modeling (e.g., (Shang et al., 2022)).
  • Streaming and Continually Evolving QA: Models must adapt knowledge as new sources become available over time, balancing adaptation with retention (e.g., StreamingQA (Liška et al., 2022), CLTSQA (Yang et al., 17 Jul 2024)).
  • Temporal Reasoning over Text and Multimodal Inputs: Handling cross-modal QA involving numerical time series with associated natural language (e.g., Chat-TS (Quinlan et al., 13 Mar 2025), MTBench (Chen et al., 21 Mar 2025), ITFormer (Wang et al., 25 Jun 2025)).

Critical subproblems include:

  • Temporal expression extraction and normalization;
  • Reasoning over explicit and implicit temporal constraints;
  • Multi-hop temporal inference;
  • Temporal rationale faithfulness in answers;
  • Handling diachronic and multimodal corpora.

2. Datasets and Benchmarks

A rich variety of datasets drive TSQA research, reflecting diversity in data modality, question complexity, and temporal reasoning depth:

Dataset/Benchmark Data Types Key Focus Scale / Coverage
ForecastQA (Jin et al., 2020) News Text (time-stamped) Event Forecasting 10,392 Q-A pairs, 5 years
StreamingQA (Liška et al., 2022) News Text, Timelines Adaptation, Drift 14 years, quarterly splits
ComplexTempQA (Gruber et al., 7 Jun 2024) Wikipedia/Wikidata Multi-hop, Large 100M+ pairs, 36 years
EngineMT-QA (Wang et al., 25 Jun 2025) Sensor TS + Text Multimodal QA 110K Q-A pairs, real-world
MTBench (Chen et al., 21 Mar 2025) Financial/Weather TS + Text Cross-modal QA Multi-domain, labeled tasks
TDBench (Kim et al., 4 Aug 2025) Temporal DB Factual QA, Eval 6K+ pairs, 13 operators
CLTSQA-Data (Yang et al., 17 Jul 2024) WikiData/Text Continual Learning 50K Qs, ∼5K contexts, staged
UnSeenTimeQA (Uddin et al., 3 Jul 2024) Synthetic Scenarios Reasoning-only Unlimited, no web leakage

Significant advances in dataset construction include:

3. Methodologies and Model Architectures

Approaches in TSQA span several paradigms reflecting both linguistic and numerical aspects:

Temporal Text and Knowledge Graph QA

  • Temporal Cutoff Enforcement: Strictly limiting accessible evidence to pre-specified time points to simulate real-world forecasting (e.g., ForecastQA (Jin et al., 2020)).
  • Timestamp Estimation and Temporal Embeddings: Inferring latent timestamps from questions, employing multi-linear interactions and sinusoidal positional encodings (e.g., TCompLEx score: S(s,r,t,o)=Re(es,er,eo,et)S(s, r, t, o) = \operatorname{Re}(\langle e_s, e_r, e_o, e_t \rangle) (Shang et al., 2022)).
  • Contrastive and Auxiliary Losses: Enforcing temporal order and contrastive learning over question pairs differing only in time expressions (Shang et al., 2022, Son et al., 2023, Yang et al., 17 Jul 2024).
  • Temporal Graph Extraction and Fusion: Construction of event–time–relation graphs (via CAEVO, SUTime), with fusion by explicit edge representation or GNN modules in transformers (e.g., ERR fusion, RelGraphConv update) (Su et al., 2023).

Multimodal and Time-Series Integration

  • Time-Series Encoders Coupled to LLMs: Models like ITFormer (Wang et al., 25 Jun 2025) employ hierarchical position encoding (temporal, channel, segment), learnable instruction tokens, and instruct time attention to align/fuse time-series representations with frozen LLMs.
  • Discrete Time-Series Tokenization: Methods such as Chat-TS (Quinlan et al., 13 Mar 2025) convert numerical series to discrete tokens, extending LLM vocabulary for direct joint reasoning.
  • Program-Aided Decomposition: Domain agents such as TS-Reasoner (Ye et al., 5 Oct 2024) translate natural language into structured workflows, execute precise numeric/statistical computations, and incorporate domain knowledge, with adaptive self-refinement.

Learning with Noisy or Pseudo-Labels

  • Pseudo-Labeling via VLMs: Large-scale TSQA models can be effectively trained with labels produced by VLMs (e.g., GPT-4o), exploiting the noise robustness of DNNs to achieve accuracy higher than the pseudo-label generator (Fujimura et al., 30 Sep 2025).

4. Evaluation Methodologies and Metrics

Multiple tailored metrics and evaluation protocols have been introduced for TSQA:

  • Traditional QA Metrics: Exact Match (EM), F1, set-level accuracy for multi-answer cases (Tan et al., 2023, Kong et al., 26 Feb 2025).
  • Time Accuracy (T) and Answer-Time Accuracy (AT): Evaluating not only the returned answer but the correctness of temporal justifications, with partial credit for cases where only some required dates are correct (Kim et al., 4 Aug 2025):

T(q)={tf(q)correctly predicted}f(q)×100%T(q) = \frac{|\{t \in f(q)\, \text{correctly predicted}\}|}{|f(q)|} \times 100\%

Brier Score=1Ni=1Nc=1C(picyic)2\text{Brier Score} = \frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C \left(p_{ic} - y_{ic}\right)^2

  • Domain-Specific Success Metrics: E.g., Absolute Average Profit, Relative Average Profit, and MAPE for time-series inference tasks (Ye et al., 5 Oct 2024).

5. Current Results, Robustness, and Open Challenges

Analysis across established benchmarks demonstrates:

  • Even the best performing BERT-based models on event forecasting lag human judgment by at least 10–19% accuracy (e.g., 60.1% vs. 71.2–79.4% in ForecastQA (Jin et al., 2020)).
  • Systems that model temporal order explicitly (e.g., GRU-based aggregation) and contrastive learning on time-expressions show marked improvements, e.g., 32% absolute error reduction in temporal KGQA (Shang et al., 2022).
  • Multimodal models like ITFormer outperform adapted vision–language approaches and general LLMs, while using fewer than 1% additional trainable parameters (Wang et al., 25 Jun 2025).
  • Robustness analysis (e.g., UnSeenTimeQA (Uddin et al., 3 Jul 2024)) reveals that LLMs excel at shallow or memorization tasks but degrade significantly for multi-step event dependencies and parallel events (up to 45% performance drop on hard splits).
  • In factual, database-driven QA, significant time hallucination persists—average drops of 21.7% when correctness of temporal references is explicitly required alongside content (Kim et al., 4 Aug 2025).

6. Future Directions and Open Research Problems

Open challenges and future directions, as outlined across recent work, include:

  • Automated Adaptation and Continual Learning: Frameworks combining temporal memory replay and contrastive learning (as in CLTSQA (Yang et al., 17 Jul 2024)) are necessary to cope with knowledge drift and catastrophic forgetting in dynamic environments.
  • Faithfulness in Temporal Justification: Methods enforcing and evaluating the temporal consistency of answer rationales (e.g., the Faith framework (Jia et al., 23 Feb 2024), TDBench (Kim et al., 4 Aug 2025)) are critical for high-stakes domains.
  • Fine-Grained Temporal and Multi-Hop Reasoning: Dataset design, augmentation strategies (e.g., pseudo-instruction tuning, temporal shifting (Tan et al., 2023)), and complex temporally stratified benchmarks (e.g., ComplexTempQA (Gruber et al., 7 Jun 2024)) are central for progress.
  • Scalability, Efficiency, and Domain Adaptation: Efficient lightweight modules connecting structured TS encoders to LLMs, parameter-efficient fine-tuning, and domain-specific module generation are demonstrated to be effective (e.g., ITFormer (Wang et al., 25 Jun 2025), TS-Reasoner (Ye et al., 5 Oct 2024)).
  • Evaluation Beyond Memorization: Synthetic, contamination-free settings (e.g., UnSeenTimeQA (Uddin et al., 3 Jul 2024)) and robust pseudo-labeling techniques (e.g., (Fujimura et al., 30 Sep 2025)) allow for stringent evaluation of true reasoning versus retrieval or memorization.

7. Practical Applications and Impacts

TSQA methods are foundational in:

The field’s continued innovation in scalable benchmarks, robust reasoning modules, cross-modal architectures, and faithfulness evaluation is steadily bridging the gap between machine and human capabilities in temporal reasoning and time-sensitive decision making.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Time-Series Question Answering (TSQA).