TS Instruct Training Dataset

Updated 20 November 2025

TS Instruct Training Dataset is a multimodal resource combining time-series data and contextual text instructions to enable LLMs to reason over temporal data in sectors like healthcare and finance.
It employs synthetic instruction–response pairs with preprocessing techniques such as normalization and tokenization to support tasks like forecasting, anomaly detection, and classification.
The dataset’s robust quality assurance and mixed tuning protocols enhance LLM performance while identifying opportunities for improvements in precision and real-world data diversity.

The TS Instruct Training Dataset is a multimodal instruction–response resource developed to enable and evaluate LLMs in temporal reasoning tasks involving both time-series data and natural language. It is constructed as part of the Chat-TS framework, representing a foundational step toward practical multi-modal LLMs in domains such as healthcare, finance, transportation, and energy. The dataset introduces a new approach to integrating real-world multivariate time-series and context-rich text instructions for LLM instruction tuning and benchmarking (Quinlan et al., 13 Mar 2025).

1. Purpose, Scope, and Conceptual Design

The TS Instruct Training Dataset is designed to overcome the scarcity of large, diverse resources combining time-series and contextual language prompts. Its main objectives are:

To instruction-tune LLMs for joint reasoning over numerical sequences and textual instructions.
To teach models to ingest raw or pre-tokenized time-series data and generate accurate, human-readable analyses, forecasts, or anomaly detections.
To preserve and assess the model's native NLP capabilities by blending multi-modal and standard instruction samples in the training mix.

The dataset is constructed to serve both as a training corpus and as a basis for robust evaluation in realistic multi-modal reasoning scenarios.

2. Dataset Composition and Statistical Properties

The collection comprises 18,412 multimodal samples. The domain distribution and relevant statistics are summarized below:

Domain	Count	Proportion (%)	Example Use Case
Healthcare	4,600	25	ECG/vital signs analysis
Finance	5,523	30	Equity price forecasting
Transportation	3,682	20	Traffic flow anomaly det.
Energy	4,607	25	Solar output prediction

Key sequence properties:

Sliding window lengths $L \in [50,\,200]$ timesteps.
Channel count $M \in \{1,\,3,\,5\}$ (domain-dependent).
Each time-series window after tokenization yields $T = M \cdot L + M - 1$ tokens.
Instructions: mean 12 tokens, σ ≈ 5, range [5, 30].
Responses: mean 45 tokens, σ ≈ 20, range [10, 100].

3. Data Generation Workflow and Preprocessing

3.1 Data Sources

Time-series fragments are sampled from the LOTSA repository (public multivariate signals).
Labels for classification are drawn from the Time-Series Classification Archive.

3.2 Dialogue Synthesis

Synthetic instruction–response pairs are generated using GPT-4o-mini prompted with metadata, a user instruction (e.g., "Detect anomalies in this ECG trace."), and the expected assistant output (answer and rationale).

3.3 Preprocessing and Tokenization

Raw $x \in \mathbb{R}$ values are clipped, z-score normalized, and quantized.
During tokenization, the $[-s,s]$ interval is uniformly divided into $K-1$ bins. Token sequences are flattened channel-wise, with a separator after each channel and indices offset to ensure disjoint vocabularies between text and time-series tokens.
Image renderings assist the generation process but are not supplied to the LLM.

4. Example Structure and Annotation Schema

Each JSON-formatted training instance contains:

{
  "metadata": {
    "domain": "finance",
    "window_length": L,
    "channels": M
  },
  "time_series_tokens": [ ... ], // integers in V_T
  "instruction": "Please forecast the next 10 steps.",
  "response": "Based on the trend, we predict the next ten values to be ..."
}

The user–assistant interaction is strictly synthetic. Noted examples:

Descriptive analysis (e.g., heart rate irregularity detection).
Forecasting (e.g., stock trend prediction).
Anomaly detection (e.g., outlier identification in traffic flow).

5. Quality Assurance and Validation

All 18,412 samples are semi-automatically validated. A random 10% subset underwent manual review for consistency, with a reported inter-annotator κ ≈ 0.82.
The TS-Instruct QA Gold Benchmark set is fully human-verified, ensuring label correctness and non-redundancy of distractor options via Mechanical Turk protocols.

6. Instruction Tuning Protocol and Loss Functions

Instruction-tuning is performed on LLama 3.1-8B using a mixed dataset $D = D_{TS-Instruct} \cup D_{OpenOrca}$ .

Token sequence $y_{1:T}$ includes concatenated time-series, instruction, and response tokens.
The cross-entropy loss:

$\mathcal{L}_{CE}(\theta) = -\sum_{t=1}^{T-1} \log p_\theta(y_{t+1} \mid y_{1:t})$

No explicit loss term is required for modality alignment due to unified tokenization.

Minibatches are sampled at a 1:5 ratio of TS-Instruct to OpenOrca samples to maintain natural language fluency while focusing on multi-modal reasoning capabilities.

7. Scale, Limitations, and Prospective Extensions

Scale

18k multimodal samples and 100k text-only prompts constitute a moderate corpus size compared to most pure-text tuning datasets.

Limitations

All instructions and responses are synthetic, limiting the diversity and authenticity relative to real user dialogues.
Quantization of time-series windows introduces a loss in absolute value fidelity, constraining tasks requiring high-precision numerical reasoning.
The approach exhibits degradation in zero-shot classification for previously unseen labels.

Future Directions

Incorporation of human-elicited dialogue, particularly in specialized fields.
Finer-grained (e.g., log-scale) tokenization to better preserve magnitude information.
Expansion to include additional modalities, such as text reports or images alongside time-series.
Integration of explicit auxiliary losses (e.g., MSE for numeric prediction).

This resource represents a critical step toward large-scale, multi-modal instruction tuning for LLMs in temporal reasoning, providing a reproducible and extensible platform for downstream research in multi-domain, time-dependent decision support (Quinlan et al., 13 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to TS Instruct Training Dataset.