TS-GPT: A Time Series Transformer
- TS-GPT is a generative pre-trained transformer model that employs innovations representation theory to enable causal and probabilistic time series forecasting.
- The architecture uses a causal autoencoder with masked attention and adversarial regularization to reliably quantify uncertainty and generate future samples.
- Benchmarked on real-time locational marginal price forecasting, TS-GPT outperforms traditional models in CRPS and other probabilistic metrics for engineering applications.
Time Series GPT (TS-GPT) denotes a class of generative pre-trained transformer architectures designed to address probabilistic time series modeling in domains governed predominantly by physical laws and requiring strict causal structure. Unlike conventional LLMs, which are optimized for natural language, TS-GPT models are grounded in the innovations representation theory and embed mechanisms suitable for real-time operational forecasting and control, particularly in engineering settings. The model is constructed to produce future time series samples from conditional probability distributions given observed history, enabling reliable uncertainty quantification and robust decision support in complex dynamical environments.
1. Theoretical Foundation: Innovations Representation
The central theoretical underpinning of TS-GPT is the innovations representation theory of Wiener, Kallianpur, and Rosenblatt. For a stationary random time series process , this theory establishes that there exist measurable, causal mappings—a causal encoder and a causal decoder —such that the observed process can be transformed into a latent innovations process , where each is independent and uniformly distributed: The inverse mapping reconstructs causally as
There are two principal forms:
- Strong Innovations Representation (SIR): is reconstructed exactly sample-wise via ; the innovations are a sufficient (lossless) statistic.
- Weak Innovations Representation (WIR): and are matched only in distribution; still induces the correct conditional statistics for forecasting.
This separation explicitly enforces causality in the autoencoding (past-to-future) direction, mirroring the operational requirements of online forecasting and control in physical systems.
2. Model Architecture and Causal Structure
TS-GPT instantiates a generative pre-trained transformer framework, but with architectural and procedural adaptations specific to time series dynamics and engineering priorities:
- Causal Autoencoder: The encoder is strictly causal, mapping the observed past and current values to the latent innovations . No future information is accessible during this step, preserving the real-time feasibility required for monitoring and control.
- Decoder : The decoder reconstructs (or generates) time series samples from latent innovations , yielding either pointwise or distributional matches ().
- Attention Mechanisms: Transformer-based attention modules are tailored to handle the characteristic temporal dependencies, sharp transitions ("spikiness"), and long-range interactions common in engineering time series. This differentiates TS-GPT from NLP transformers, where such attributes are less prevalent.
- Adversarial Regularization: Dual discriminator networks (for innovations and reconstructions) are employed. These drive min–max optimization (in the form of Wasserstein GAN losses), imposing the requirement that the learned innovations sequence approaches an IID Uniform distribution, and the reconstructions match the real data’s distribution.
The aggregate architecture reflects an overview of innovations theory, causal modeling, and modern generative transformer design, ensuring the latent representation is statistically principled and operationally causal.
3. Probabilistic Generative Forecasting
TS-GPT is inherently designed for Probabilistic Generative Forecasting (GPF). This approach enables the sampling of entire future trajectories consistent with the conditional probability distribution given past observations, rather than producing a single deterministic forecast. Concretely:
- The past observed series is encoded into .
- For forecasting steps ahead, pseudo-innovations are drawn independently from .
- The decoder produces samples as
The generative capability arises from the independence of and the functional sufficiency of the latent representation. This enables sampling from the full conditional distribution:
This property supports direct uncertainty quantification and probabilistic decision-making.
4. Architectural and Algorithmic Details
Several architectural innovations distinguish TS-GPT from both classical time series models and transformer-based LLMs:
- Training Objective: For SIR, the objective includes a reconstructive loss ensuring samplewise. For WIR, the objective is adversarial, matching the conditional distributions.
- Discriminator Networks: Two discriminators, and , enforce, respectively, that is IID-Uniform and that reconstructions are indistinguishable from in distribution.
- Adversarial Minimax Optimization:
- For :
- For :
where is the Wasserstein GAN loss. This setup ensures innovations have the desired stochastic properties and that generated samples (both reconstructions and forecasts) are distributionally faithful.
- Attention and Causal Masking: Attention weights are masked strictly to prevent future leakage, supporting full causality.
These design choices are essential for engineering applications, as they guarantee causal sample generation and principled probabilistic reasoning directly from the network.
5. Benchmark Application: Real-Time Locational Marginal Price Forecasting
The paper demonstrates TS-GPT on the task of real-time locational marginal price (LMP) forecasting for electricity markets—a domain characterized by highly volatile, spiky time series regulated by physical and regulatory constraints:
- Dataset: 5-minute LMP intervals from U.S. independent system operators, spanning multiple days and multiple locations.
- Forecast Horizon: (i.e., 1 hour ahead).
- Procedure: The weak innovations autoencoder (WIAE) variant of TS-GPT encodes historical LMPs into innovations, appends pseudo-innovations, and decodes multiple samples per forecast window.
- Baselines: TLAE, DeepVAR, and a leading LLM-based model (BWGVT).
- Evaluation: TS-GPT outperforms these baselines with respect to probabilistic forecasting metrics:
- Coverage Probability Error (CPE) and Absolute CPE (ACPE): Metrics measure how well empirical coverage matches nominal confidence intervals. Results show lower CRPS and ACPE for TS-GPT, indicating improved calibration and sharper predictive uncertainty.
6. Broader Applications and Impact
TS-GPT’s architecture and theoretical guarantees make it suitable for real-time forecasting, anomaly detection, and operational support across a range of engineering and scientific domains:
- Energy Systems: Forecasting demand, renewable generation, and volatility in power grids where real-time accuracy and uncertainty quantification are required.
- Industrial Process Control: Real-time sensor monitoring and predictive maintenance, leveraging both deterministic and stochastic structure.
- Anomaly Detection: Innovations representation naturally isolates deviations from expected behavior, improving interpretability and responsiveness in settings such as cybersecurity, sensor networks, and physical monitoring.
- Financial Time Series: Models with similar architecture can handle market behavior where volatility, heavy tails, and rapid structural change are pronounced.
- Environmental and Climate Modeling: Probabilistic multi-horizon forecasting respecting underlying physical causal structure.
A distinctive feature is the explicit design for physical—rather than linguistic—law adherence, ensuring the foundation model is operationally relevant in scientific, engineering, and control-oriented workflows.
7. Limitations and Future Research Directions
The TS-GPT approach, as presented, is architected for stationary or locally stationary stochastic processes with clearly defined innovations. Open challenges and avenues for extension include:
- Nonstationarity Handling: Adaptation to strongly nonstationary or regime-switching time series where innovations structure is less clear.
- Multi-variate and Multimodal Integration: Extension to high-dimensional, multi-variate, or multimodal settings (including structured exogenous variables) while maintaining the efficient innovations encoding.
- Robustness and Interpretability: Improved interpretability of transformer attention patterns and latent innovations for complex engineering systems.
- Scalability: Efficient training and autoregressive sampling for massive, high-frequency datasets.
- Integration with Control Systems: Embedding TS-GPT within feedback or closed-loop control architectures for autonomous decision making.
Such directions may extend the reach of TS-GPT to even broader operational and decision-theoretic regimes.
TS-GPT synthesizes classical stochastic process theory with modern generative transformer models, achieving a foundation model with robust causal structure, uncertainty quantification, and strong empirical performance in probabilistic forecasting, as evidenced by improved CRPS and CPE/ACPE benchmarks in demanding real-time applications (Tong et al., 2 Oct 2025).