InstructTime++: Generative Time Series Classification

Updated 28 January 2026

InstructTime++ is a framework that transforms time series classification by integrating continuous temporal signals, contextual metadata, and implicit features into a unified instruction-based pipeline.
It segments input data into patches, applies convolutional encoding with vector quantization, and aligns discrete tokens with language model embeddings via MLP, enabling seamless multimodal processing.
Leveraging generative self-supervised pre-training and supervised fine-tuning, it achieves superior cross-domain generalization and state-of-the-art performance on diverse benchmark datasets.

InstructTime++ is an advanced framework for time series classification that reformulates canonical discriminative tasks into a multimodal, generative paradigm centered on open-ended, instruction-based interaction. It treats continuous temporal data, contextual meta-information, and task instructions as multimodal textual and numerical inputs, and leverages LLMs—augmented with implicit feature representations—to generate semantically rich label texts. By integrating discrete temporal tokenization, explicit and implicit feature extraction, alignment into LLM (LM) embedding space, and unified generative training, InstructTime++ achieves superior cross-domain generalization and state-of-the-art accuracy across benchmark time series datasets (Cheng et al., 21 Jan 2026).

1. Architectural Foundations

InstructTime++ processes each input time series $X \in \mathbb{R}^{L \times H}$ by segmenting it into $P$ non-overlapping patches $(s_1, \dots, s_P)$ . Each patch is encoded by a convolutional subnetwork (e.g., Temporal Convolutional Network) to yield a latent vector $z_p$ . Vector-Quantized (VQ) encoding maps $z_p$ to its nearest centroid $e_{k^\star}$ in a learnable codebook, producing a discrete TS-Token sequence $\hat Z = [e_{k^\star(1)}, \dots, e_{k^\star(P)}]$ .

The input prompt concatenates:

The instruction string (arbitrary, domain- or task-specific)
Explicit side-information (e.g., context such as age, sensor location, converted to text)
The TS-Token stream
Implicit features (statistical summaries and visual-language captions, see below)

All prompt elements are projected into a shared LM embedding space: discrete TS-Tokens via a small MLP, natural language tokens via the LM’s own tokenizer and embedding matrix. The full input is fed to a transformer-based generative LM (e.g., GPT2, with future variants targeting Llama-class models), which generates label text autoregressively (Cheng et al., 21 Jan 2026, Cheng et al., 2024).

A key challenge for unified multimodal processing lies in the modality gap between quantized temporal features (TS-Tokens) and semantic language representations. InstructTime++ mitigates this through an MLP "alignment projector," mapping $e_{k^\star} \to h_p^{(X)} = W_X\, e_{k^\star} + b_X$ for each patch. All input embeddings (discrete tokens, contextual text, captions) are concatenated and input jointly to the LM, without additional cross-attention mechanisms: attention fusion is handled implicitly by transformer blocks.

Domain-agnostic pre-training over multiple datasets further tightens alignment, ensuring that latent spaces encode transferable features and that prompt templates generalize across application areas (Cheng et al., 21 Jan 2026).

3. Generative Self-Supervised Training and Fine-Tuning

InstructTime++ employs a two-stage generative learning paradigm:

Self-supervised pre-training: The LM is exposed to prompts composed of multimodal elements across multiple domains, optimized via the next-token prediction objective:

$\mathcal{L}_{\rm gen} = -\sum_{t=1}^{T} \log p\bigl(w_t\mid w_{<t},\,h^{(X)},\,\text{context},\,\text{instructions}\bigr)$

At this stage, target tokens $w_t$ are simply subsequent prompt elements.

Supervised fine-tuning: After pre-training, the LM is fine-tuned on labeled target-domain tasks, with prompts paired to label texts. The loss is standard negative log-likelihood over output label sequences.

This process enhances domain transferability and enables robust instruction-based classification (Cheng et al., 21 Jan 2026, Cheng et al., 2024).

4. Implicit Feature Enhancement

To address the limited temporal inductive bias of generic LMs, InstructTime++ incorporates implicit feature enhancement, expanding the prompt with:

Statistical feature extraction: Computes distributional (mean, variance, skewness, kurtosis), complexity (sample/approximate entropy), and long-term shape (trend slope, spectral period) statistics over input $X$ , rendered to text, e.g., “Mean=0.12, variance=1.73, sample-entropy=0.85, trend-slope=+0.002…”.
Vision-language image captioning: Renders $X$ as a 2D plot $V$ (e.g., line plot), passes $V$ through a pretrained vision-LLM, and inserts its generated caption (“The waveform shows a rising trend...”) into the prompt.

Both implicit summaries are injected as natural language in fixed fields (“Statistical features: ...”, “Visual description: ...”), seamlessly integrating with other prompt elements without dedicated heads or adapters (Cheng et al., 21 Jan 2026).

5. Model Integration and Inference

The full model input comprises the sequence:

1 2	[INSTRUCTION] [Contextual text] [Statistical features] [Visual description] [TS-Tokens: k_1 k_2 ... k_P] [<BET>] [Candidate labels] [<EET>]

The transformer attends jointly to these inputs, enabling the LM to reason across modalities for label generation. No parameterized cross-modal fusion modules are required, as the transformer’s standard attention suffices for alignment.

Inference proceeds via standard autoregressive decoding, mapping output text to class labels (Cheng et al., 21 Jan 2026).

6. Empirical Performance

Extensive validation demonstrates that InstructTime++ achieves or surpasses state-of-the-art results across EEG (sleep), ECG (multilabel), HAR, bearing fault (FD), and animal vocalization (RWC) benchmarks. Notable metrics include:

Dataset	Acc (InstructTime++-Adapt)	Macro F1 (InstructTime++-Adapt)	Best Baseline Acc	Best Baseline Macro F1
EEG	0.8776	0.6735	0.8452	0.6240
ECG	0.4693	0.6489	0.4121	0.5547
HAR	0.9257	0.9284	0.9298*	0.9307*
FD	0.9934	0.9952	0.9901	0.9917
RWC	0.8048	0.8041	0.7803	0.7796

(*Certain baselines temporarily match on HAR.)

Key contributions to performance include the integration of implicit features and multimodal alignment, yielding consistent improvements in accuracy and F1 across settings (Cheng et al., 21 Jan 2026).

7. Contributions, Limitations, and Future Prospects

InstructTime++ establishes a comprehensive generative, instruction-driven approach to time series classification. Its major contributions are:

Generative reformulation: Replaces direct one-hot mapping by multimodal generation conditioned on instructions, context, and implicit features.
Explicit/implicit fusion: Jointly leverages quantized sequence tokens and derived statistical/visual features to guide LM reasoning.
Unified prompt-centric interface: All information is composed as a single text-augmented prompt with language-model-compatible embeddings.

Limitations include the computational cost of large LM and vision-language captioner inference, rigidity of the fixed template prompt, and open questions on extending implicit patterns to richer representations (e.g., graph features, recurrence plots). Future work envisions more adaptive prompt structures, efficient adaptation methods (e.g., LoRA, adapters), and richer implicit feature extraction pipelines (Cheng et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement (2026)

Advancing Time Series Classification with Multimodal Language Modeling (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InstructTime++.