TSAIA Benchmark: LLM Time Series Reasoning

Updated 11 December 2025

TSAIA Benchmark is a comprehensive testbed that assesses LLMs’ ability to perform multi-step reasoning on practical time series tasks, spanning forecasting, diagnostics, analytics, and decision-making.
It utilizes real-world datasets such as PSML, ERA5, MIT-BIH, and Yahoo Finance along with structured task templates to validate performance against metrics like MAPE and F1-score.
The benchmark highlights LLM limitations in chained reasoning and recommends hybrid approaches and domain-specific tuning to improve operational analysis workflows.

The TSAIA Benchmark is a comprehensive evaluation suite for systematically assessing the capacity of LLMs to perform multi-step, compositional reasoning and inference over real-world time series analysis tasks. Introduced in "When LLM Meets Time Series: Can LLMs Perform Multi-Step Time Series Reasoning and Inference" (Ye et al., 1 Sep 2025), the TSAIA Benchmark constitutes the first wide-coverage testbed dedicated to characterizing how and to what extent state-of-the-art LLMs function as reasoning assistants for practical time series workflows, encompassing both data-centric and symbolic-computational subtasks.

1. Scope and Task Catalog

TSAIA formalizes 33 task types clustered into four functional categories, reflecting the breadth of contemporary time series analytics demands encountered in engineering, operational monitoring, and financial applications:

Predictive (Constraint-Aware Forecasting; 12 tasks): Encompasses single- and multi-series forecasting with operational constraints, including ramp-rate-limited and variability-limited horizon-wide predictions. Example: Max load prediction for grid operations subject to ramp constraints, using real datasets such as PSML (Public Smart Meter Load).
Diagnostic (Anomaly Detection & Causal Discovery; 8 tasks): Includes detection of anomalies given labeled or reference (control) samples (e.g., extreme weather detection, ECG anomaly classification), and identification of causal relationships with quantitative or qualitative priors. Data sources range from ERA5 climate reanalysis to MIT-BIH ECG repositories.
Analytical (Financial Time-Series Reasoning; 11 tasks): Comprises scalar and vectorial computations standard to portfolio management: future price/volatility prediction (Yahoo Finance), risk-return metrics (Sharpe Ratio, Max Drawdown, Calmar, Sortino, Information Ratio), and trading-signal generation with backtested returns.
Decision-Making (Interpretive Analytics; 2 tasks): Multiple-choice reasoning over situated financial summaries, demanding selection of optimal portfolios or comparative statements (e.g., Jensen’s alpha/beta), typically in tabular/textual format.

This multidomain structure ensures coverage of both univariate and multivariate times series, as well as blending pure numerical/signal-processing and symbolic inference tasks.

2. Dataset Sources and Input/Output Specifications

The benchmark amalgamates real-world datasets, including:

PSML (electricity grid load and weather covariates for operational forecasting),
ERA5 (daily station climate series for anomaly detection),
MIT-BIH PhysioNet (clinical ECG time series),
LEAD 2022/Kaggle (building energy usage),
Yahoo Finance (historical, high-frequency OHLC price data for S&P 500 constituents).

Tasks specify rigorous input-output mappings:

Forecasting tasks: variable-length context windows, explicit covariate sets, and operational constraint parameters (e.g., ramp rate R_max).
Diagnostic tasks: reference window(s), known anomaly rates, or domain priors.
Analytical/financial: 60-252 day input windows, with performance evaluated on scalar or vector outputs (e.g., price scalar, binary trading signals).

The task definition pipeline explicitly records the data type, context, and expected form of both input and output for all instances.

3. Dynamic Task and Question Generation Pipeline

A dynamic question generator underpins TSAIA, operationalizing new evaluation tasks through an extensible template mechanism. This template mechanism takes as arguments:

Task archetype (from the 33-type library)
Data source selection (CSV or pickle)
Contextual parameterization (randomized or user-specified window sizes, covariate subsets, constraint bounds)
Domain constraint/prior injection (e.g., enforcing $|\Delta \text{load}_t| \leq R_{\max}$ )
Ground truth computation (direct value lookup or procedural evaluation via code execution/backtesting)

Sample templates such as “You are given past {N} hours of grid load and weather covariates. Predict the next {H}-hour maximum load, ensuring that hourly increments do not exceed ramp-rate {R} MW/h.” become instantiable task instances for model evaluation.

This infrastructure enables scalable expansion to new domains and facilitates reproducible, programmatically generated evaluation datasets.

4. Task-Specific Evaluation Protocols and Metrics

TSAIA enforces task-appropriate success criteria, adapting its scoring to the heterogeneity of the tasks:

Task Type	Success Criteria	Principal Metric
Constrained Forecasting	Correct output shape ∧ all constraints met ∧ nontrivial (MAPE < 1)	Mean Absolute Percentage Error (MAPE)
Anomaly Detection	Binary mask of correct length ∧ $F_1 > 0$	$F_1$ -score
Causal Inference	Correct matrix shape ∧ domain knowledge incorporated	Accuracy ( $\frac{\#\text{correct}}{\#\text{possible}}$ )
Financial Scalar	Scalar output ∧ $\|$ error $\|<0.05$	Absolute Error
Trading/Backtest	Correct prediction length ∧ backtest return $\geq$ 0	Cumulative and Annualized Return, MDD

Evaluation is three-stage:

Structural validation (format, shape)
Constraint/prior check
Explicit metric computation against the ground truth

Statistical significance between model performances is assessable via paired bootstrap resampling, although not all results were reported with significance tests.

5. Baseline Models and Unified Evaluation Protocol

The benchmark systematically evaluates eight LLMs:

Proprietary API models (GPT-4o, Claude-3.5 Sonnet, Gemini-2.0)
Large open-source instruct and specialized models (Qwen2.5-Max, Llama-3.1 Instruct 70B, DeepSeek, Codestral, DeepSeek-R)

Prompting is performed with the "CodeAct" system templates, which decouple code-execution from textual chain-of-thought, and use deterministic decoding (temperature 0).

Model outputs pass through the unified protocol, which flags failures as structural (invalid format), execution errors, constraint (or prior) violation, or sub-threshold metric quality. All models are scored on aggregate success rate per (task × model), with canonical metrics reported for each subdomain.

6. Empirical Results and Performance Analysis

Representative findings include:

Forecasting: Success rate drops by >30% when constraints are introduced (e.g., ramp/variability limits). Best MAPE values cluster near 0.03–0.10 for leading models (Qwen2.5-Max, DeepSeek-R, Codestral).
Diagnostics: LLMs achieve $F_1 \approx 0.55-0.65$ on ECG anomaly detection; performance is markedly below best-in-field non-LLM baselines.
Analytical tasks (financial): Simple metrics such as Sharpe ratio and annualized return are computed correctly by top models in 73–100% of cases, with open-weight instruct models competitive with proprietary APIs.
Decision-making MCQs: Accuracies cluster near the random guess baseline (25–40%), indicating a strong limitation in multi-step integrative reasoning.

All models demonstrate pronounced brittleness in assembling compositional pipelines—tasks involving chained steps such as data retrieval, calibration, constraint satisfaction, and coding, sharply reduce success rates to <40%.

7. Limitations and Recommendations for Future Work

TSAIA reveals several failure modes and research directions:

LLM Limitations: Transformer-based LLMs typically fail in multi-step operational workflows, especially when faced with domain-specific constraints, chained sub-tasks, or numeric precision demands (Ye et al., 1 Sep 2025).
Execution Errors: Execution failures (Python code, anomalous zero outputs, trivial answers) are common in complex settings.
Hybridization Need: Pure LLMs are insufficient for many tasks; hybrid approaches integrating symbolic solvers, code interpreters, or domain-specific rule modules are recommended.
Domain Adaptation: Fine-tuning on time-series-centric code corpora, or directly on real-world workflow traces, is proposed for new model development.
Protocol Enhancements: Expansion to spatio-temporal, irregularly sampled, streaming, and interactive (user-guided) tasks is explicitly targeted for future benchmark releases.
Statistical Rigor: Increased adoption of rigorous statistical testing and reporting of significance levels is necessary.

This suggests that while LLMs have advanced capabilities in single-step time series analytics, robust deployment for real-world, compositional, and operational analysis demands significant advances in prompt engineering, hybrid architectures, and domain adaptation. The TSAIA benchmark provides a reproducible foundation for such research, and is publicly available via Hugging Face and GitHub (Ye et al., 1 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

When LLM Meets Time Series: Can LLMs Perform Multi-Step Time Series Reasoning and Inference (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to TSAIA Benchmark.