Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 38 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

TS-GPT: A Time Series Transformer

Updated 3 October 2025
  • TS-GPT is a generative pre-trained transformer model that employs innovations representation theory to enable causal and probabilistic time series forecasting.
  • The architecture uses a causal autoencoder with masked attention and adversarial regularization to reliably quantify uncertainty and generate future samples.
  • Benchmarked on real-time locational marginal price forecasting, TS-GPT outperforms traditional models in CRPS and other probabilistic metrics for engineering applications.

Time Series GPT (TS-GPT) denotes a class of generative pre-trained transformer architectures designed to address probabilistic time series modeling in domains governed predominantly by physical laws and requiring strict causal structure. Unlike conventional LLMs, which are optimized for natural language, TS-GPT models are grounded in the innovations representation theory and embed mechanisms suitable for real-time operational forecasting and control, particularly in engineering settings. The model is constructed to produce future time series samples from conditional probability distributions given observed history, enabling reliable uncertainty quantification and robust decision support in complex dynamical environments.

1. Theoretical Foundation: Innovations Representation

The central theoretical underpinning of TS-GPT is the innovations representation theory of Wiener, Kallianpur, and Rosenblatt. For a stationary random time series process x=(xt)x = (x_t), this theory establishes that there exist measurable, causal mappings—a causal encoder GG and a causal decoder HH—such that the observed process can be transformed into a latent innovations process v=(vt)v = (v_t), where each vtv_t is independent and uniformly distributed: vt=G(xt,xt1,)v_t = G(x_t, x_{t-1}, \dots) The inverse mapping reconstructs xtx_t causally as

x^t=H(vt,vt1,)\hat{x}_t = H(v_t, v_{t-1}, \dots)

There are two principal forms:

  • Strong Innovations Representation (SIR): xx is reconstructed exactly sample-wise via HH; the innovations vv are a sufficient (lossless) statistic.
  • Weak Innovations Representation (WIR): xx and x^\hat{x} are matched only in distribution; vv still induces the correct conditional statistics for forecasting.

This separation explicitly enforces causality in the autoencoding (past-to-future) direction, mirroring the operational requirements of online forecasting and control in physical systems.

2. Model Architecture and Causal Structure

TS-GPT instantiates a generative pre-trained transformer framework, but with architectural and procedural adaptations specific to time series dynamics and engineering priorities:

  • Causal Autoencoder: The encoder GeG_e is strictly causal, mapping the observed past and current values (xt,xt1,)(x_t, x_{t-1}, \dots) to the latent innovations vtv_t. No future information is accessible during this step, preserving the real-time feasibility required for monitoring and control.
  • Decoder HnH_n: The decoder reconstructs (or generates) time series samples from latent innovations (vt,vt1,)(v_t, v_{t-1}, \dots), yielding either pointwise or distributional matches (x^t=dxt\hat{x}_t \overset{d}{=} x_t).
  • Attention Mechanisms: Transformer-based attention modules are tailored to handle the characteristic temporal dependencies, sharp transitions ("spikiness"), and long-range interactions common in engineering time series. This differentiates TS-GPT from NLP transformers, where such attributes are less prevalent.
  • Adversarial Regularization: Dual discriminator networks (for innovations and reconstructions) are employed. These drive min–max optimization (in the form of Wasserstein GAN losses), imposing the requirement that the learned innovations sequence approaches an IID Uniform distribution, and the reconstructions match the real data’s distribution.

The aggregate architecture reflects an overview of innovations theory, causal modeling, and modern generative transformer design, ensuring the latent representation is statistically principled and operationally causal.

3. Probabilistic Generative Forecasting

TS-GPT is inherently designed for Probabilistic Generative Forecasting (GPF). This approach enables the sampling of entire future trajectories consistent with the conditional probability distribution given past observations, rather than producing a single deterministic forecast. Concretely:

  • The past observed series x0:tx_{0:t} is encoded into v0:tv_{0:t}.
  • For forecasting TT steps ahead, pseudo-innovations v~t+1:t+T\tilde{v}_{t+1:t+T} are drawn independently from U(0,1)U(0,1).
  • The decoder produces samples as

x^t+T=Hn(v0:t,v~t+1:t+T)\hat{x}_{t+T} = H_n(v_{0:t}, \tilde{v}_{t+1:t+T})

The generative capability arises from the independence of v~\tilde{v} and the functional sufficiency of the latent representation. This enables sampling from the full conditional distribution: P(xt+Txx0:t)=P(Hn(v0:t,v~t+1:t+T)x)\mathbb{P}(x_{t+T} \leq x \mid x_{0:t}) = \mathbb{P}(H_n(v_{0:t}, \tilde{v}_{t+1:t+T}) \leq x)

This property supports direct uncertainty quantification and probabilistic decision-making.

4. Architectural and Algorithmic Details

Several architectural innovations distinguish TS-GPT from both classical time series models and transformer-based LLMs:

  • Training Objective: For SIR, the objective includes a reconstructive loss ensuring x^t=xt\hat{x}_t = x_t samplewise. For WIR, the objective is adversarial, matching the conditional distributions.
  • Discriminator Networks: Two discriminators, D(1)D^{(1)} and D(2)D^{(2)}, enforce, respectively, that vv is IID-Uniform and that reconstructions x^\hat{x} are indistinguishable from xx in distribution.
  • Adversarial Minimax Optimization:

    • For vv:

    minGemaxD(1)LW(v,D(1))\min_{G_e} \max_{D^{(1)}} L_{W}(v, D^{(1)}) - For x^\hat{x}:

    minHnmaxD(2)LW(x^,D(2))\min_{H_n} \max_{D^{(2)}} L_{W}(\hat{x}, D^{(2)})

    where LWL_{W} is the Wasserstein GAN loss. This setup ensures innovations have the desired stochastic properties and that generated samples (both reconstructions and forecasts) are distributionally faithful.

  • Attention and Causal Masking: Attention weights are masked strictly to prevent future leakage, supporting full causality.

These design choices are essential for engineering applications, as they guarantee causal sample generation and principled probabilistic reasoning directly from the network.

5. Benchmark Application: Real-Time Locational Marginal Price Forecasting

The paper demonstrates TS-GPT on the task of real-time locational marginal price (LMP) forecasting for electricity markets—a domain characterized by highly volatile, spiky time series regulated by physical and regulatory constraints:

  • Dataset: 5-minute LMP intervals from U.S. independent system operators, spanning multiple days and multiple locations.
  • Forecast Horizon: T=12T=12 (i.e., 1 hour ahead).
  • Procedure: The weak innovations autoencoder (WIAE) variant of TS-GPT encodes historical LMPs into innovations, appends pseudo-innovations, and decodes multiple samples per forecast window.
  • Baselines: TLAE, DeepVAR, and a leading LLM-based model (BWGVT).
  • Evaluation: TS-GPT outperforms these baselines with respect to probabilistic forecasting metrics:

    CRPS(F,x)=F(z)1{xz}dz\mathrm{CRPS}(F, x) = \int_{-\infty}^{\infty} | F(z) - \mathbb{1}\{x \leq z\} | dz - Coverage Probability Error (CPE) and Absolute CPE (ACPE): Metrics measure how well empirical coverage matches nominal confidence intervals. Results show lower CRPS and ACPE for TS-GPT, indicating improved calibration and sharper predictive uncertainty.

6. Broader Applications and Impact

TS-GPT’s architecture and theoretical guarantees make it suitable for real-time forecasting, anomaly detection, and operational support across a range of engineering and scientific domains:

  • Energy Systems: Forecasting demand, renewable generation, and volatility in power grids where real-time accuracy and uncertainty quantification are required.
  • Industrial Process Control: Real-time sensor monitoring and predictive maintenance, leveraging both deterministic and stochastic structure.
  • Anomaly Detection: Innovations representation naturally isolates deviations from expected behavior, improving interpretability and responsiveness in settings such as cybersecurity, sensor networks, and physical monitoring.
  • Financial Time Series: Models with similar architecture can handle market behavior where volatility, heavy tails, and rapid structural change are pronounced.
  • Environmental and Climate Modeling: Probabilistic multi-horizon forecasting respecting underlying physical causal structure.

A distinctive feature is the explicit design for physical—rather than linguistic—law adherence, ensuring the foundation model is operationally relevant in scientific, engineering, and control-oriented workflows.

7. Limitations and Future Research Directions

The TS-GPT approach, as presented, is architected for stationary or locally stationary stochastic processes with clearly defined innovations. Open challenges and avenues for extension include:

  • Nonstationarity Handling: Adaptation to strongly nonstationary or regime-switching time series where innovations structure is less clear.
  • Multi-variate and Multimodal Integration: Extension to high-dimensional, multi-variate, or multimodal settings (including structured exogenous variables) while maintaining the efficient innovations encoding.
  • Robustness and Interpretability: Improved interpretability of transformer attention patterns and latent innovations for complex engineering systems.
  • Scalability: Efficient training and autoregressive sampling for massive, high-frequency datasets.
  • Integration with Control Systems: Embedding TS-GPT within feedback or closed-loop control architectures for autonomous decision making.

Such directions may extend the reach of TS-GPT to even broader operational and decision-theoretic regimes.


TS-GPT synthesizes classical stochastic process theory with modern generative transformer models, achieving a foundation model with robust causal structure, uncertainty quantification, and strong empirical performance in probabilistic forecasting, as evidenced by improved CRPS and CPE/ACPE benchmarks in demanding real-time applications (Tong et al., 2 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Time Series GPT (TS-GPT).