Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 362 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Time-Series Question Answering (TSQA)

Updated 8 October 2025
  • Time-Series Question Answering (TSQA) is a research field that processes natural language queries on time-stamped data, integrating temporal reasoning, forecasting, and multimodal analysis.
  • It employs methods such as temporal expression extraction, multi-hop inference, and data fusion from texts, graphs, and sensor signals to address dynamic information needs.
  • Applications span finance, healthcare, and logistics, where TSQA systems enhance decision support by ensuring timely, accurate retrieval of temporal facts.

Time-Series Question Answering (TSQA) is a research area concerned with designing, evaluating, and deploying models and systems that answer natural language questions involving temporal information, time-evolving facts, or reasoning directly from time-stamped data sources. TSQA encompasses a spectrum of methodologies—ranging from text-centric temporal QA and knowledge-graph-based temporal QA, to multi-modal systems integrating numerical time series and contextual language, and explicit time series signal analysis via domain-specific agents. The field addresses critical questions in dynamic domains such as forecasting, factual temporal knowledge retrieval, scenario-driven planning, and robust multi-hop temporal reasoning.

1. Task Formalization and Subproblem Taxonomy

TSQA tasks are defined by their reliance on time-evolving inputs and require models to restrict retrieval or reasoning to evidence aligned with temporal constraints. This includes:

  • Forecasting-oriented QA: Given a corpus of news articles or historical records with timestamps, answer questions about events that occur after the latest available information, using only past data (e.g., ForecastQA (Jin et al., 2020)).
  • Temporal KGQA: Answering questions over temporal knowledge graphs (KGs), inferring time intervals, entity roles, or factual transitions via timestamp estimation and temporal order modeling (e.g., (Shang et al., 2022)).
  • Streaming and Continually Evolving QA: Models must adapt knowledge as new sources become available over time, balancing adaptation with retention (e.g., StreamingQA (Liška et al., 2022), CLTSQA (Yang et al., 17 Jul 2024)).
  • Temporal Reasoning over Text and Multimodal Inputs: Handling cross-modal QA involving numerical time series with associated natural language (e.g., Chat-TS (Quinlan et al., 13 Mar 2025), MTBench (Chen et al., 21 Mar 2025), ITFormer (Wang et al., 25 Jun 2025)).

Critical subproblems include:

  • Temporal expression extraction and normalization;
  • Reasoning over explicit and implicit temporal constraints;
  • Multi-hop temporal inference;
  • Temporal rationale faithfulness in answers;
  • Handling diachronic and multimodal corpora.

2. Datasets and Benchmarks

A rich variety of datasets drive TSQA research, reflecting diversity in data modality, question complexity, and temporal reasoning depth:

Dataset/Benchmark Data Types Key Focus Scale / Coverage
ForecastQA (Jin et al., 2020) News Text (time-stamped) Event Forecasting 10,392 Q-A pairs, 5 years
StreamingQA (Liška et al., 2022) News Text, Timelines Adaptation, Drift 14 years, quarterly splits
ComplexTempQA (Gruber et al., 7 Jun 2024) Wikipedia/Wikidata Multi-hop, Large 100M+ pairs, 36 years
EngineMT-QA (Wang et al., 25 Jun 2025) Sensor TS + Text Multimodal QA 110K Q-A pairs, real-world
MTBench (Chen et al., 21 Mar 2025) Financial/Weather TS + Text Cross-modal QA Multi-domain, labeled tasks
TDBench (Kim et al., 4 Aug 2025) Temporal DB Factual QA, Eval 6K+ pairs, 13 operators
CLTSQA-Data (Yang et al., 17 Jul 2024) WikiData/Text Continual Learning 50K Qs, ∼5K contexts, staged
UnSeenTimeQA (Uddin et al., 3 Jul 2024) Synthetic Scenarios Reasoning-only Unlimited, no web leakage

Significant advances in dataset construction include:

3. Methodologies and Model Architectures

Approaches in TSQA span several paradigms reflecting both linguistic and numerical aspects:

Temporal Text and Knowledge Graph QA

  • Temporal Cutoff Enforcement: Strictly limiting accessible evidence to pre-specified time points to simulate real-world forecasting (e.g., ForecastQA (Jin et al., 2020)).
  • Timestamp Estimation and Temporal Embeddings: Inferring latent timestamps from questions, employing multi-linear interactions and sinusoidal positional encodings (e.g., TCompLEx score: S(s,r,t,o)=Re(es,er,eo,et)S(s, r, t, o) = \operatorname{Re}(\langle e_s, e_r, e_o, e_t \rangle) (Shang et al., 2022)).
  • Contrastive and Auxiliary Losses: Enforcing temporal order and contrastive learning over question pairs differing only in time expressions (Shang et al., 2022, Son et al., 2023, Yang et al., 17 Jul 2024).
  • Temporal Graph Extraction and Fusion: Construction of event–time–relation graphs (via CAEVO, SUTime), with fusion by explicit edge representation or GNN modules in transformers (e.g., ERR fusion, RelGraphConv update) (Su et al., 2023).

Multimodal and Time-Series Integration

  • Time-Series Encoders Coupled to LLMs: Models like ITFormer (Wang et al., 25 Jun 2025) employ hierarchical position encoding (temporal, channel, segment), learnable instruction tokens, and instruct time attention to align/fuse time-series representations with frozen LLMs.
  • Discrete Time-Series Tokenization: Methods such as Chat-TS (Quinlan et al., 13 Mar 2025) convert numerical series to discrete tokens, extending LLM vocabulary for direct joint reasoning.
  • Program-Aided Decomposition: Domain agents such as TS-Reasoner (Ye et al., 5 Oct 2024) translate natural language into structured workflows, execute precise numeric/statistical computations, and incorporate domain knowledge, with adaptive self-refinement.

Learning with Noisy or Pseudo-Labels

  • Pseudo-Labeling via VLMs: Large-scale TSQA models can be effectively trained with labels produced by VLMs (e.g., GPT-4o), exploiting the noise robustness of DNNs to achieve accuracy higher than the pseudo-label generator (Fujimura et al., 30 Sep 2025).

4. Evaluation Methodologies and Metrics

Multiple tailored metrics and evaluation protocols have been introduced for TSQA:

  • Traditional QA Metrics: Exact Match (EM), F1, set-level accuracy for multi-answer cases (Tan et al., 2023, Kong et al., 26 Feb 2025).
  • Time Accuracy (T) and Answer-Time Accuracy (AT): Evaluating not only the returned answer but the correctness of temporal justifications, with partial credit for cases where only some required dates are correct (Kim et al., 4 Aug 2025):

T(q)={tf(q)correctly predicted}f(q)×100%T(q) = \frac{|\{t \in f(q)\, \text{correctly predicted}\}|}{|f(q)|} \times 100\%

Brier Score=1Ni=1Nc=1C(picyic)2\text{Brier Score} = \frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C \left(p_{ic} - y_{ic}\right)^2

  • Domain-Specific Success Metrics: E.g., Absolute Average Profit, Relative Average Profit, and MAPE for time-series inference tasks (Ye et al., 5 Oct 2024).

5. Current Results, Robustness, and Open Challenges

Analysis across established benchmarks demonstrates:

  • Even the best performing BERT-based models on event forecasting lag human judgment by at least 10–19% accuracy (e.g., 60.1% vs. 71.2–79.4% in ForecastQA (Jin et al., 2020)).
  • Systems that model temporal order explicitly (e.g., GRU-based aggregation) and contrastive learning on time-expressions show marked improvements, e.g., 32% absolute error reduction in temporal KGQA (Shang et al., 2022).
  • Multimodal models like ITFormer outperform adapted vision–language approaches and general LLMs, while using fewer than 1% additional trainable parameters (Wang et al., 25 Jun 2025).
  • Robustness analysis (e.g., UnSeenTimeQA (Uddin et al., 3 Jul 2024)) reveals that LLMs excel at shallow or memorization tasks but degrade significantly for multi-step event dependencies and parallel events (up to 45% performance drop on hard splits).
  • In factual, database-driven QA, significant time hallucination persists—average drops of 21.7% when correctness of temporal references is explicitly required alongside content (Kim et al., 4 Aug 2025).

6. Future Directions and Open Research Problems

Open challenges and future directions, as outlined across recent work, include:

  • Automated Adaptation and Continual Learning: Frameworks combining temporal memory replay and contrastive learning (as in CLTSQA (Yang et al., 17 Jul 2024)) are necessary to cope with knowledge drift and catastrophic forgetting in dynamic environments.
  • Faithfulness in Temporal Justification: Methods enforcing and evaluating the temporal consistency of answer rationales (e.g., the Faith framework (Jia et al., 23 Feb 2024), TDBench (Kim et al., 4 Aug 2025)) are critical for high-stakes domains.
  • Fine-Grained Temporal and Multi-Hop Reasoning: Dataset design, augmentation strategies (e.g., pseudo-instruction tuning, temporal shifting (Tan et al., 2023)), and complex temporally stratified benchmarks (e.g., ComplexTempQA (Gruber et al., 7 Jun 2024)) are central for progress.
  • Scalability, Efficiency, and Domain Adaptation: Efficient lightweight modules connecting structured TS encoders to LLMs, parameter-efficient fine-tuning, and domain-specific module generation are demonstrated to be effective (e.g., ITFormer (Wang et al., 25 Jun 2025), TS-Reasoner (Ye et al., 5 Oct 2024)).
  • Evaluation Beyond Memorization: Synthetic, contamination-free settings (e.g., UnSeenTimeQA (Uddin et al., 3 Jul 2024)) and robust pseudo-labeling techniques (e.g., (Fujimura et al., 30 Sep 2025)) allow for stringent evaluation of true reasoning versus retrieval or memorization.

7. Practical Applications and Impacts

TSQA methods are foundational in:

The field’s continued innovation in scalable benchmarks, robust reasoning modules, cross-modal architectures, and faithfulness evaluation is steadily bridging the gap between machine and human capabilities in temporal reasoning and time-sensitive decision making.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Time-Series Question Answering (TSQA).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube