InstructTime++: Generative Time Series Classification
- InstructTime++ is a framework that transforms time series classification by integrating continuous temporal signals, contextual metadata, and implicit features into a unified instruction-based pipeline.
- It segments input data into patches, applies convolutional encoding with vector quantization, and aligns discrete tokens with language model embeddings via MLP, enabling seamless multimodal processing.
- Leveraging generative self-supervised pre-training and supervised fine-tuning, it achieves superior cross-domain generalization and state-of-the-art performance on diverse benchmark datasets.
InstructTime++ is an advanced framework for time series classification that reformulates canonical discriminative tasks into a multimodal, generative paradigm centered on open-ended, instruction-based interaction. It treats continuous temporal data, contextual meta-information, and task instructions as multimodal textual and numerical inputs, and leverages LLMs—augmented with implicit feature representations—to generate semantically rich label texts. By integrating discrete temporal tokenization, explicit and implicit feature extraction, alignment into LLM (LM) embedding space, and unified generative training, InstructTime++ achieves superior cross-domain generalization and state-of-the-art accuracy across benchmark time series datasets (Cheng et al., 21 Jan 2026).
1. Architectural Foundations
InstructTime++ processes each input time series by segmenting it into non-overlapping patches . Each patch is encoded by a convolutional subnetwork (e.g., Temporal Convolutional Network) to yield a latent vector . Vector-Quantized (VQ) encoding maps to its nearest centroid in a learnable codebook, producing a discrete TS-Token sequence .
The input prompt concatenates:
- The instruction string (arbitrary, domain- or task-specific)
- Explicit side-information (e.g., context such as age, sensor location, converted to text)
- The TS-Token stream
- Implicit features (statistical summaries and visual-language captions, see below)
All prompt elements are projected into a shared LM embedding space: discrete TS-Tokens via a small MLP, natural language tokens via the LM’s own tokenizer and embedding matrix. The full input is fed to a transformer-based generative LM (e.g., GPT2, with future variants targeting Llama-class models), which generates label text autoregressively (Cheng et al., 21 Jan 2026, Cheng et al., 2024).
2. Cross-Modal Alignment and Representation
A key challenge for unified multimodal processing lies in the modality gap between quantized temporal features (TS-Tokens) and semantic language representations. InstructTime++ mitigates this through an MLP "alignment projector," mapping for each patch. All input embeddings (discrete tokens, contextual text, captions) are concatenated and input jointly to the LM, without additional cross-attention mechanisms: attention fusion is handled implicitly by transformer blocks.
Domain-agnostic pre-training over multiple datasets further tightens alignment, ensuring that latent spaces encode transferable features and that prompt templates generalize across application areas (Cheng et al., 21 Jan 2026).
3. Generative Self-Supervised Training and Fine-Tuning
InstructTime++ employs a two-stage generative learning paradigm:
- Self-supervised pre-training: The LM is exposed to prompts composed of multimodal elements across multiple domains, optimized via the next-token prediction objective:
At this stage, target tokens are simply subsequent prompt elements.
- Supervised fine-tuning: After pre-training, the LM is fine-tuned on labeled target-domain tasks, with prompts paired to label texts. The loss is standard negative log-likelihood over output label sequences.
This process enhances domain transferability and enables robust instruction-based classification (Cheng et al., 21 Jan 2026, Cheng et al., 2024).
4. Implicit Feature Enhancement
To address the limited temporal inductive bias of generic LMs, InstructTime++ incorporates implicit feature enhancement, expanding the prompt with:
- Statistical feature extraction: Computes distributional (mean, variance, skewness, kurtosis), complexity (sample/approximate entropy), and long-term shape (trend slope, spectral period) statistics over input , rendered to text, e.g., “Mean=0.12, variance=1.73, sample-entropy=0.85, trend-slope=+0.002…”.
- Vision-language image captioning: Renders as a 2D plot (e.g., line plot), passes through a pretrained vision-LLM, and inserts its generated caption (“The waveform shows a rising trend...”) into the prompt.
Both implicit summaries are injected as natural language in fixed fields (“Statistical features: ...”, “Visual description: ...”), seamlessly integrating with other prompt elements without dedicated heads or adapters (Cheng et al., 21 Jan 2026).
5. Model Integration and Inference
The full model input comprises the sequence:
1 2 |
[INSTRUCTION] [Contextual text] [Statistical features] [Visual description] [TS-Tokens: k_1 k_2 ... k_P] [<BET>] [Candidate labels] [<EET>] |
Inference proceeds via standard autoregressive decoding, mapping output text to class labels (Cheng et al., 21 Jan 2026).
6. Empirical Performance
Extensive validation demonstrates that InstructTime++ achieves or surpasses state-of-the-art results across EEG (sleep), ECG (multilabel), HAR, bearing fault (FD), and animal vocalization (RWC) benchmarks. Notable metrics include:
| Dataset | Acc (InstructTime++-Adapt) | Macro F1 (InstructTime++-Adapt) | Best Baseline Acc | Best Baseline Macro F1 |
|---|---|---|---|---|
| EEG | 0.8776 | 0.6735 | 0.8452 | 0.6240 |
| ECG | 0.4693 | 0.6489 | 0.4121 | 0.5547 |
| HAR | 0.9257 | 0.9284 | 0.9298* | 0.9307* |
| FD | 0.9934 | 0.9952 | 0.9901 | 0.9917 |
| RWC | 0.8048 | 0.8041 | 0.7803 | 0.7796 |
(*Certain baselines temporarily match on HAR.)
Key contributions to performance include the integration of implicit features and multimodal alignment, yielding consistent improvements in accuracy and F1 across settings (Cheng et al., 21 Jan 2026).
7. Contributions, Limitations, and Future Prospects
InstructTime++ establishes a comprehensive generative, instruction-driven approach to time series classification. Its major contributions are:
- Generative reformulation: Replaces direct one-hot mapping by multimodal generation conditioned on instructions, context, and implicit features.
- Explicit/implicit fusion: Jointly leverages quantized sequence tokens and derived statistical/visual features to guide LM reasoning.
- Unified prompt-centric interface: All information is composed as a single text-augmented prompt with language-model-compatible embeddings.
Limitations include the computational cost of large LM and vision-language captioner inference, rigidity of the fixed template prompt, and open questions on extending implicit patterns to richer representations (e.g., graph features, recurrence plots). Future work envisions more adaptive prompt structures, efficient adaptation methods (e.g., LoRA, adapters), and richer implicit feature extraction pipelines (Cheng et al., 21 Jan 2026).