Time Series Question Answering (TSQA)

Updated 3 January 2026

TSQA is a field focused on interpreting natural language queries over dynamic, time-evolving data, integrating both factual and numerical signal analysis.
It leverages advanced models that fuse temporal database techniques, multimodal signal processing, and chain-of-thought corrections to enhance reasoning and accuracy.
TSQA has broad applications in healthcare, finance, and sensor analytics, providing real-time insights and robust temporal adaptation for complex data environments.

Time Series Question Answering (TSQA) involves the automated answering of natural-language questions about factual knowledge or technical patterns that vary over time, or about multivariate time series signals of arbitrary domain. TSQA subsumes both fact-centric temporal QA in dynamic databases (e.g., "Who was president of Brazil in 2018?") and multimodal QA involving direct analysis of numerical sequences (e.g., "Is there a trend reversal in this ECG trace?"). The field has rapidly expanded due to the ubiquity of temporal data and the increasing capabilities of LLMs, with specialized benchmarks, modeling approaches, and evaluation metrics now established for both factual and signal-based TSQA tasks (Kim et al., 4 Aug 2025, Chen et al., 21 Mar 2025, Wang et al., 25 Jun 2025, Kong et al., 26 Feb 2025, Divo et al., 7 Nov 2025, Su et al., 27 Dec 2025).

1. Problem Formulation and Taxonomy

TSQA tasks are formally defined across two principal settings:

Factual Time-Sensitive Question Answering: Given context $C$ consisting of time-evolving facts (often in the form $(s, r, o_i, [t_{s_i}, t_{e_i}])$ ), a natural-language question $Q$ , and a required time or temporal expression, the goal is to output the answer $A$ such that its validity depends explicitly on the temporal context; i.e., $A$ must change if any time reference in $Q$ is altered (Yang et al., 2024, Yang et al., 2024).
Multimodal Time Series QA: Given a sampled numerical time series $x(t)$ (e.g., $x \in \mathbb{R}^{T \times d}$ ) and $Q$ , predict an answer $A$ —a label, a value, or a textual explanation—that reflects analytic or semantic properties of the series, optionally integrating structured context (e.g., technical metadata) or unstructured text (news, reports) (Chen et al., 21 Mar 2025, Wang et al., 25 Jun 2025, Divo et al., 7 Nov 2025, Kong et al., 26 Feb 2025).

Taxonomies further subdivide TSQA into:

Attribute (Attr) Questions: Extraction of attribute values at specific times or intervals.
Comparison (Comp) Questions: Temporal or numerical comparisons between different time intervals, entities, or signals.
Counting (Count) Questions: Quantification over time (e.g., "How many years did X occur?").
Open-ended and Reasoning QA: Explanations, counterfactuals, causal inference over temporal data (Gruber et al., 2024, Divo et al., 7 Nov 2025).

2. Benchmark Datasets and Data Generation

A variety of benchmarks calibrate both factual and time series TSQA tasks, each emphasizing particular technical challenges:

TDBench (Kim et al., 4 Aug 2025): Constructs TSQA pairs by executing temporal SQL queries over uni-temporal databases, employing temporal functional dependencies to ensure time-local uniqueness of answers. Covers all 13 interval relations from Allen’s taxonomy (before, after, meet, overlap, contain, etc.), with data spanning Wikipedia, legal, environmental, and synthetic medical domains.
ComplexTempQA (Gruber et al., 2024): Over 100M QA pairs derived from Wikipedia year pages and Wikidata, with explicit time-metadata, multi-hop event/entity relations, and a taxonomy capturing attributes, comparisons, and event counting.
StreamingQA (Liška et al., 2022): 11M news articles (2007–2020), 200K+ temporal QA pairs with question dates and supporting passages, designed for continual evaluation and streaming adaptation.
Multimodal TSQA:
- MTBench (Chen et al., 21 Mar 2025): 20K financial/news and 2K weather/meteorological time series paired with textual reports, annotated for forecasting, trend analysis, and cross-modal causal QA.
- EngineMT-QA (Wang et al., 25 Jun 2025): 110K time series + natural language QA pairs from multivariate aero-engine signals covering understanding, diagnosis, prediction, and prescription tasks.
- TSQA Dataset (Time-MQA) (Kong et al., 26 Feb 2025): ~200K QA pairs across 12 domains (finance, healthcare, sensors) with forecasting, imputation, anomaly detection, classification, and open-ended QA.
Synthetic and motion analysis benchmarks:
- QuAnTS (Divo et al., 7 Nov 2025): 150K QA pairs on human skeleton time series, spanning descriptive, temporal, and comparison queries about action sequences.
- UnSeenTimeQA (Uddin et al., 2024): Synthetic logistics/event scenarios designed to explicitly probe LLM temporal reasoning without data leakage from pretraining corpora.

3. Model Architectures and Temporal Fusion

TSQA models adopt architectural variations tailored to temporal reasoning:

Temporal Database QA (TDBench) (Kim et al., 4 Aug 2025): LLMs are prompted to paraphrase SQL outputs and justify answers; multi-hop reasoning is synthesized via temporal joins.
Signal-based Time Series QA:
- ITFormer (Wang et al., 25 Jun 2025): Fuses time series encoder (with time token, segment, channel position encodings) and a frozen LLM using learnable instruct tokens and cross-modal attention (Instruct Time Attention), replacing designated language tokens with fused temporal representations.
- Informer-LLM Hybrid (Fujimura et al., 30 Sep 2025): Time series → Informer encoder → MLP → <time-series embedding> token injected into a frozen LLM (e.g., Mistral-7B), which then autoregressively decodes the answer.
- Neuro-symbolic Pipelines (QuAnTS) (Divo et al., 7 Nov 2025): Partition the task into sequence-level perception (action recognition with xLSTM-Mixers) followed by symbolic, instruction-tuned LLM reasoning.
Temporal Graph Fusion (Su et al., 2023): Extracts temporal event graphs from question/context, encodes as XML-style tags (ERR) or with R-GCNs, and fuses into Transformer-based QA models.
Chain-of-Thought Correction Frameworks:
- T3LLM (Su et al., 27 Dec 2025): Employs a three-agent system (worker, reviewer, student); the reviewer inspects generated stepwise reasoning chains, truncates erroneous steps, and inserts reflection comments, iteratively refining multi-hop temporal/numerical reasoning.
Continual Learning and Temporal Contrast:
- CLTSQA Framework (Yang et al., 2024): Integrates temporal memory replay (pruning hardest/outdated samples, adding context-matched distractors) and temporal contrastive learning (triplet loss and paraphrased/contrasted questions) to retain prior knowledge and sharpen sensitivity to temporal shifts.

4. Evaluation Metrics and Protocols

TSQA evaluation protocols are distinguished by their focus on the interplay of correctness and temporal rationale:

Factual TSQA:
- Answer Accuracy (A): Fraction of exact matches for target entities or labels.
- Time Accuracy (T): Fraction of cases where model-provided time references (start, end, interval) match valid intervals; calculated as $\text{TimeAccuracy}(q) = \frac{|\{\,t \in f(q) \mid t \ \text{correctly mentioned}\}|}{|f(q)|} \times 100\%$ .
- Answer–Time Accuracy (AT): Cases correct on both entity and temporal constraint; highlights errors undetected by A alone (Kim et al., 4 Aug 2025).
Numerical/Signal TSQA:
- Regression: RMSE, MAE, MSE, MAPE for forecasting/imputation.
- Classification/Trend Analysis: Accuracy, (macro) F1, confusion matrix, per-class precision/recall.
- Open-ended QA: ROUGE-L, BLEU, METEOR, LLMJudge (human/LLM evaluation), and EM.
Other Protocols:
- Continual adaptation: Measures EM/F1 over temporal slices, monitoring both adaptation (on most recent data) and forgetting (on earlier slices).
- Chain-of-thought error correction: Tracks resolution and propagation of errors across reasoning steps, evaluating correction loop convergence and impact on downstream tasks (Su et al., 27 Dec 2025).

5. Empirical Findings and Error Analysis

Comprehensive empirical analyses reveal characteristic strengths and limitations of TSQA models:

Temporal hallucinations: Models frequently output correct entities but incorrect or hallucinated time references, causing substantial A–T accuracy gaps (e.g., GPT-4o: A ≈ 74%, T ≈ 49%, AT drop ≈25 points on Wikipedia QA) (Kim et al., 4 Aug 2025).
Trend and Causality: LLMs capture short-term trends and basic event-time lookup, but struggle with long-range dependencies, multi-hop event chains, and compositional hypotheses ("Who led country X during event Y?") (Gruber et al., 2024, Chen et al., 21 Mar 2025).
Temporal adaptation: Continual learning frameworks employing memory replay and contrastive learning mitigate catastrophic forgetting on early time-slices while retaining adaptability to new time periods (Yang et al., 2024, Liška et al., 2022).
Signal-based domain bottlenecks: Direct serial tokenization of high-dimensional numerical input (e.g., QuAnTS, 72-dimensional motion) causes LLMs and TS-LLMs to approach random performance, whereas neuro-symbolic pipelines (action-perception + LLM) achieve or surpass human parity (Divo et al., 7 Nov 2025).
Pseudo-labeling: Noise-robust architectures trained on VLM pseudo-labels can outperform the original VLM teacher, provided sample sizes are sufficiently large to mitigate random label errors (Rolnick effect) (Fujimura et al., 30 Sep 2025).
Chain-of-thought correction: Multi-agent correction architectures such as T3LLM yield large improvements in accuracy and error recovery when reasoning steps can be directly verified against input sequences (Su et al., 27 Dec 2025).

6. Applications, Implications, and Future Directions

TSQA methodologies have immediate applicability to numerous domains where time-varying factual or signal data is crucial:

Domain-specialized QA: Medical records, policy documents, industrial sensors, environmental regulatory data—all require precise temporal grounding for accurate QA.
Updatable benchmarks: Temporal database-based TSQA (TDBench) allows for automatic generation of new QA pairs as source databases are refreshed, supporting real-time evaluation of LLM knowledge (Kim et al., 4 Aug 2025).
Multimodal integration: Advanced cross-modal fusion models (e.g., ITFormer) enable holistic QA over text, signals, and structured time series, supporting interactive analytic and diagnostic applications in engineering and health (Wang et al., 25 Jun 2025, Kong et al., 26 Feb 2025).
Continual and streaming adaptation: Models and semi-parametric pipelines capable of dynamic knowledge ingestion and revisable memory indexing are essential for deployment in domains characterized by fast-evolving information (e.g., financial trading, news analytics, emergent medical scenarios) (Liška et al., 2022, Yang et al., 2024).
Synthetic evaluation and reasoning: Synthetic, contamination-free datasets (UnSeenTimeQA) expose the true temporal reasoning capabilities of LLMs, enabling fine-grained error analysis and targeted curriculum design (Uddin et al., 2024).

Future research is expected to further explore end-to-end differentiable symbolic fusion, scalable neuro-symbolic reasoning, retrieval-augmented QA over temporally-indexed memories, streaming and continual adaptation protocols, and advanced multimodal/interactive TSQA interfaces (Divo et al., 7 Nov 2025, Gruber et al., 2024, Chen et al., 21 Mar 2025, Su et al., 27 Dec 2025, Kong et al., 26 Feb 2025).