Papers
Topics
Authors
Recent
2000 character limit reached

TS-Instruct QA Gold Benchmark

Updated 4 December 2025
  • TS-Instruct QA Gold Benchmark is a multimodal evaluation framework integrating time-series analysis and natural language processing to assess LLM reasoning.
  • It features a balanced multiple-choice setup with human-verified questions covering trend detection, anomaly identification, causal inference, and quantitative calculations.
  • Empirical results highlight superior performance of GPT-based models, particularly Chat-TS variants, while also exposing limitations in precise quantitative forecasting.

The TS-Instruct QA Gold Benchmark is a rigorously curated evaluation set devised to measure a LLM’s (LLM) proficiency in multimodal reasoning over real-world time-series data in conjunction with natural-language questions. Developed within the Chat-TS framework, its primary purpose is to stress-test and compare state-of-the-art neural models in analytic domains requiring the integration of time-series analysis and textual comprehension, thus filling a gap in benchmarks for multimodal LLMs (Quinlan et al., 13 Mar 2025).

1. Design Objectives and Intended Scope

The benchmark targets comprehensive evaluation of multimodal reasoning whereby a model must jointly process, interpret, and draw inferences from time-series data and textual context. Main goals include:

  • Stress-testing LLMs on trend analysis, anomaly detection, causal inference, and quantitative calculation given natural language queries referencing complex time-series.
  • Establishing a gold-standard reference set with unambiguous correct answers and uniform answer balance, supporting fair and robust comparisons across models. The task covers real-world settings such as healthcare, finance, transportation, and energy where time-series analysis is performed alongside contextual language inputs.

2. Dataset Composition and Construction

Each TS-Instruct QA Gold entry integrates two key modalities:

  • Time-Series Signals: Examples drawn primarily from the LOTSA forecasting repository and Time-Series Classification Archive. Each series is provided as both a 2D plot (PNG) for question generation and as discrete token sequences for model input. Metadata such as sequence length LL and channel count MM is always included in the preamble.
  • Natural-Language Context: Each example contains a textual description of the time-series (and its metadata), an English-language question prompt, four candidate answer choices (A–D), and a detailed rationale for the correct choice.

The benchmark comprises 1,056 multiple-choice questions. Human verification by independent annotators ensures that each question has a unique correct answer and no distractor answer is accidentally true. This process results in tightly controlled answer distribution, with all four choices appearing exactly 264 times as correct throughout the set. Questions rejected during review or failing correctness checks were pruned. This human-in-the-loop curation promotes high data fidelity and minimization of bias (Quinlan et al., 13 Mar 2025).

3. Question Types and Underlying Reasoning

Questions fall into four major categories designed to capture critical facets of time-series reasoning in context:

  • Trend & Slope Detection: Queries about monotonic patterns or slope, e.g., “Does the series exhibit an overall upward trend in channel 1?”
  • Anomaly & Outlier Diagnosis: Identification and localization of deviations/spikes, e.g., “Which interval contains a clear spike inconsistent with the overall pattern?”
  • Causal/Comparative Inference: Reasoning about inter-channel relationships, e.g., “If channel A rises sharply, what is the immediate effect on channel B?”
  • Quantitative Calculation: Discrete numeric estimation, e.g., “Estimate the average value of channel 1 over the last 10 timesteps.”

Formally, with input time-series XX (tokenized) and question QQ, models are tasked to select the label aa^* among choices {A,B,C,D}\{A,B,C,D\} that maximizes the output probability:

a=argmaxa{A,B,C,D}pθ(aX,Q)a^* = \arg \max_{a \in \{A,B,C,D\}} p_\theta(a | X, Q)

where pθp_\theta is the model's predicted distribution. Task loss for correctness applies the standard zero-one loss:

LQA(θ)=11Ni=1NI(a^i=aitrue)L_{QA}(\theta) = 1 - \frac{1}{N} \sum_{i=1}^N \mathbb{I}(\hat a_i = a_i^{\mathrm{true}})

4. Annotation, Quality Assurance, and Balance

All questions are generated using GPT-4o-mini with a standardized system prompt that accepts the plot and metadata, and outputs the question, choices, and explanation. Human annotators independently review the generation, enforcing two filters:

  1. The correct answer must follow unequivocally from the provided time-series.
  2. No other option should be true. Failing either criterion results in question rejection.

After filtering and downsampling, a strictly balanced label distribution is enforced so that each option (A–D) is the correct answer in exactly 264 questions. This careful construction ensures minimal dataset bias and supports prompt-robust evaluations (Quinlan et al., 13 Mar 2025).

5. Evaluation Protocols and Model Benchmarking

Performance on the TS-Instruct QA Gold set is quantified via accuracy (correct average percent), given by

Accuracy=1Ni=1NI(a^i=aitrue)\mathrm{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(\hat a_i = a_i^{\mathrm{true}})

No F1 score is used due to the single correct answer per question.

To assess model robustness, several open- and closed-source LLMs are evaluated in zero-shot mode over all examples, repeating inference five times under different system-prompt seeds and reporting the mean accuracy across runs. Baselines include Gemma-2 (2B, 9B), LLama 3.1-8B, Mistral-8B, Phi-3-medium-4k, GPT-4o-mini, GPT-4o, alongside Chat-TS variants OrcaTS and PreOrcaTS.

A summary of quantitative results: | Model | Accuracy (%) | |----------------------|--------------------| | LLama 3.1-8B | 54.22 | | Mistral-8B | 58.72 | | Phi-3-medium-4k | 64.64 | | PreOrcaTS | 67.22 | | GPT-4o-mini | 74.55 | | GPT-4o | 78.50 |

The PreOrcaTS variant of Chat-TS achieves a notable improvement (~13 points over vanilla LLama 3.1-8B) in multimodal reasoning accuracy, with GPT-based models leading performance (Quinlan et al., 13 Mar 2025).

6. Empirical Insights and Observed Limitations

Key strengths of the benchmark include:

  • High-fidelity, human-verified problems ensuring answer uniqueness.
  • Balanced answer distribution and repeated-prompt testing, minimizing systematic and prompt-specific bias.
  • Diverse categorization of reasoning types, allowing granular assessment of multimodal model competence.

Recognized limitations:

  • Restriction to multiple-choice format, excluding open-ended generative and numeric prediction tasks.
  • Lack of direct assessment for forecasting precision; models often fail at future value estimation.
  • Possible distortions in quantitative answers due to normalization of series values during tokenization. Chat-TS variants (PreOrcaTS) preserve core NLP benchmarking performance within ±2% of baseline LLama 3.1-8B on MMLU-Pro, Big Bench Hard, GPQA, indicating that extending LLMs with TS capabilities does not penalize standard NLP metrics. PreOrcaTS also demonstrates higher robustness to prompt variation.

Common failure modes include weak zero-shot inference for unseen classification categories and error-prone quantitative forecasting—a plausible implication is that models capture trends but struggle with precise value estimation. Human judgment rates Chat-TS explanations as more helpful and accurate than plain LLM outputs, a salient property for scientific and practical deployment.

7. Research Significance and Future Directions

The TS-Instruct QA Gold Benchmark establishes a standardized pipeline for transparent and reproducible evaluation of multimodal LLM reasoning over time-series and natural language. It drives research toward developing models that better integrate trend detection, anomaly identification, causal inference, and quantitative analysis within naturalistic, language-grounded tasks.

Future work should extend evaluation to generative forecasting, open-ended numeric queries, and direct temporal extrapolation. Incorporating raw, denormalized series may enhance precision in quantitative calculation tasks. Expansion beyond multiple-choice paradigms and towards application-specific metric-driven scenarios will likely further advance the state of multimodal intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TS-Instruct QA Gold Benchmark.