Time Series Foundation Models
- Time Series Foundation Models are large neural architectures pre-trained on heterogeneous time series data, capturing key temporal primitives like seasonality, trends, and frequency shifts.
- They employ transformer-based designs—including encoder–decoder, encoder-only, masked-encoder, and decoder-only variants—with pretraining strategies such as autoregressive prediction and masked token reconstruction.
- Evaluations reveal TSFMs deliver 15–30% forecasting error reductions via zero-shot and fine-tuning approaches, while hybrid ensemble techniques enhance uncertainty calibration and model robustness.
Time Series Foundation Models (TSFMs) are large neural architectures pre-trained on massive, heterogeneous corpora of time series with diverse statistical properties, enabling robust zero-shot and transferable forecasting, classification, anomaly detection, and other temporal inference tasks. Unlike traditional, bespoke models narrowly tailored to specific domains, TSFMs leverage generic temporal primitives—seasonality, trends, frequency shifts—learned from mixed-domain corpora, supporting cross-domain adaptation and data-efficient usage. Transformer-based variants dominate this category, with encoder–decoder, encoder-only, masked-encoder, and decoder-only formulations designed to generalize temporal structure for broad applicability.
1. TSFM Architectures and Pretraining Paradigms
TSFMs predominantly employ Transformer-based backbones, each adapted to encode long-range dependencies and various input modalities (Kottapalli et al., 5 Apr 2025). The principal architectural motifs include:
- Encoder–Decoder Transformers (e.g., Chronos-T5, Chronos-Bolt): Real-valued series are tokenized by scaling and quantization or window patching. The encoder consumes token sequences using bidirectional self-attention, while the autoregressive decoder generates future patches (point, quantile, or probabilistic outputs). Models such as Chronos-T5 and Chronos-Bolt utilize patch-wise quantization to facilitate domain transfer.
- Encoder-Only Transformers (e.g., Chronos-2): Specialized for direct quantile forecasting and context sharing across related series via group attention.
- Masked-Encoder Transformers (e.g., MOIRAI): Employ any-variate masked self-attention to handle multiresolution input and output mixture distributions, supporting robust multivariate and varying-length series (Yu et al., 8 Dec 2025).
- Decoder-Only Transformers (e.g., TimesFM): Enable scalable context windows for long-range dependencies by using patch-based autoregressive generation over multi-step horizons.
Pretraining objectives are matched to each architecture variant. Common paradigms include:
- Autoregressive next-value/patch prediction: Models learn to generate the next time-step or patch using previous context, a direct analogue of masked language modeling.
- Direct quantile regression: Loss functions minimize weighted quantile error (), yielding calibrated probabilistic outputs.
- Masked token reconstruction: Inputs are stochastically masked and the model reconstructs the missing values, encouraging robust abstraction (Kottapalli et al., 5 Apr 2025).
- Mixture-of-expert (MoE) routing or contrastive loss: Enriches contextual diversity and expertise routing.
Training uses large real and synthetic corpora—hundreds of billions of steps—across financial, IoT, web, energy, and weather domains (Chronos: ∼84–100B tokens; MOIRAI: ∼27B; TimesFM: >100B) (Yu et al., 8 Dec 2025).
2. Zero-Shot Inference and Adaptation Strategies
TSFMs are explicitly evaluated in zero-shot regimes (direct deployment, frozen parameters) and fine-tuned for domain adaptation:
- Zero-shot forecasting: A pre-trained model is directly applied to unseen series . Forecasts () often outperform traditional and specialized models on process-derived time series (Yu et al., 8 Dec 2025).
- Parameter-efficient fine-tuning: LoRA adapts key weight matrices with low-rank adapters (); typically only 0.1–0.5% of model parameters are tuned, balancing improved accuracy and computational cost.
- Full fine-tuning: All parameters are updated, yielding maximal adaptation at risk of overfitting, especially for small or complex datasets.
Transferability is high: even when trained on non-process domains, TSFMs generalize temporal primitives effectively to process model forecasting (PMF), evidenced by lower forecasting errors relative to baselines (Yu et al., 8 Dec 2025).
3. Temporal Representation and Internal Semantics
Recent work probes the semantic organization of TSFM representations (Pandey et al., 19 Nov 2025):
- Layer-wise concept localization: Early layers encode local, time-domain structure (AR(1), trends, level shifts); mid-layers capture spectral and warping content; late layers specialize in dispersion and change-point signals.
- Linear probe tests: Recovery of generative process parameters (—e.g., trend slope, spectral frequency, variance shifts) is feasible in early or late layers for atomic phenomena but degrades with compositional/interacting concepts.
- Representational geometry: Centered kernel alignment (CKA) and concept disentanglement scores quantify cluster separability across depth. Layers form high-similarity "blocks," and concept disentanglement increases from input to output.
- Compositionality limits: TSFMs struggle to linearly disentangle mixtures of distinct temporal concepts; probe transfer error can increase 2–5× on composite data.
These findings imply that TSFMs produce a hierarchy of temporal abstractions analogous to the edge-to-object progression in vision transformers, yet fall short in robustly representing interacting temporal phenomena.
4. Statistical Ensembles and Robustness Enhancements
TSFM predictions, while accurate, may exhibit domain-specific biases and limited uncertainty calibration. Hybrid enhancements—ensemble and statistical techniques—improve reliability and interpretability (Modi et al., 18 Aug 2025):
- Bootstrap aggregation (bagging): Independent Monte Carlo draws from the TSFM are resampled; bagged means reduce forecast variance by up to 54% in three-week context scenarios.
- Regression stacking: Multiple base forecasters (TSFM + classical/statistical) are linearly combined via cross-validated regression weights, yielding lowest mean squared errors.
- Residual statistical modeling: Systematic TSFM errors are corrected with secondary regressors, reducing mid-horizon bias (67% MSE drop in certain benchmarks).
- Prediction interval construction: Stacked ensembles yield empirical coverage close to the desired nominal level, with interval widths shrinking as context length increases.
These strategies preserve modeling capacity while injecting robust uncertainty quantification and domain correction.
5. Applicability: Process Model Forecasting and Beyond
Empirical evaluation on process-derived datasets—event logs partitioned into windows, producing directly-follows (DF) graphs mapped to multivariate time series—underscores TSFM utility for PMF (Yu et al., 8 Dec 2025):
- Transformation approach: Event logs are windowed, DF graphs are constructed, relation counts extracted, and multivariate series formed for forecasting.
- Forecasting objectives: Typically univariate per-DF-edge (), moving over context lengths of 48 days, horizon 7 days.
- Benchmark results: Zero-shot TSFMs reduce mean absolute error (MAE) and root mean squared error (RMSE) by 15–30% relative to seasonal-naive and XGBoost baselines; LoRA fine-tuning provides marginal extra gains (1–5%), sometimes negative on small/sparse logs (Sepsis).
- Process-aware evaluation: TSFMs maintain or surpass baseline entropic relevance, except for highly sparse logs.
Despite superior accuracy and efficiency, limitations remain: accurate forecasting of extremely sparse logs remains challenging, and multivariate forecasting did not surpass univariate approaches in current PMF benchmarks.
6. Insights, Trade-offs, and Limits
Key outcomes from recent systematic evaluations (Yu et al., 8 Dec 2025, Pandey et al., 19 Nov 2025, Modi et al., 18 Aug 2025):
- Transferability and data efficiency: TSFMs pretrained on generic time series encode useful primitives even for highly heterogeneous or sparse domains.
- Zero-shot models are strong defaults: Data efficiency, rapid deployment, and competitive accuracy make zero-shot use appealing, especially for limited datasets.
- Parameter-efficient fine-tuning yields small improvements: LoRA achieves marginal gains without significant parameter overhead; full fine-tuning often risks overfitting.
- Conceptual representation limitations: Compositionality and interactions between distinct time series phenomena challenge current TSFM decodability.
- Open challenges: Richer graph-structural models, compositional curriculum pretraining, spectral and warping-aware architectures, and expanded event-log corpora are critical for further advances.
TSFMs have established themselves as robust, generalizable, and highly data-efficient engines for temporal learning, exceeding classical baselines in diverse contexts, yet require architectural and algorithmic advances for full compositional and structural generalization.
7. Outlook and Future Directions
Research priorities for the TSFM field include:
- Graph-aware and multimodal extensions: Integrate process structure, heterogeneous modalities, and richer context for event logs and spatiotemporal data.
- Inter-domain benchmarks and rigorous evaluation: Adoption of contamination-checked, out-of-sample evaluation protocols, principled cross-domain splits, and public model cards to ensure robust generalization and comparability (Meyer et al., 15 Oct 2025).
- Efficient architectures for compositional structure: Design of non-linear, causal, or spectral probes, hybrid models integrating statistical and deep-latent elements, and curriculum learning over composite signals.
- Interpretability and steering: Internal semantic probing, block-wise pruning, and latent space steering offer avenues for user-guided model behavior and efficiency (Wiliński et al., 19 Sep 2024).
With the maturation of TSFMs, ongoing attention to compositional generalization, process structure, and responsible benchmarking will be necessary for safe and reliable deployment in real-world temporal analytics.