Time-Series Foundation Models (TSFMs)

Updated 7 December 2025

Time-Series Foundation Models (TSFMs) are large-scale, pre-trained neural networks that capture seasonal, trend, and regime changes using transformer-based attention mechanisms.
They leverage self-supervised learning with dynamic patching and advanced positional encoding to deliver zero-shot and few-shot forecasting, classification, and anomaly detection.
Empirical studies show TSFMs can reduce errors by up to 33% RMSE and 39% MAE, while challenges remain in covariate integration and modeling complex pathwise dependencies.

Time-Series Foundation Models (TSFMs) are a class of large-scale, pre-trained neural architectures designed to encode, predict, and analyze sequential temporal phenomena across highly diverse application domains. At their core, TSFMs learn generic, transferable representations of temporal patterns (e.g., seasonality, trend, bursts, regime shifts) in a self-supervised manner on corpora encompassing millions to hundreds of billions of time-series sequences. By leveraging attention-based architectures (typically Transformer variants), TSFMs offer modular, zero-shot and few-shot solutions to forecasting, imputation, classification, and anomaly detection tasks, often surpassing traditional and deep task-specific models in scalability, generalization, and flexibility, albeit with critical limitations in certain regimes.

1. Architectural Principles and Pretraining Paradigms

Contemporary TSFMs primarily adopt Transformer-based backbones—encoder-only, decoder-only, or encoder–decoder—with architectural modifications to handle the idiosyncrasies of temporal data. Core architectural elements include:

Tokenization and Patch Embedding: Univariate or multivariate series are divided into patches of varying granularity (e.g., Moirai, TimesFM), with each patch mapped to d-dimensional embeddings via MLPs or quantization approaches (as in Chronos). Approaches such as Kairos introduce dynamic patching, routing each patch through "experts" corresponding to different granularities based on local entropy and information density (Feng et al., 30 Sep 2025).
Positional Encoding: Temporal ordering is modeled through learned, sinusoidal, or rotary positional encodings (RoPE). Kairos extends this with instance-adaptive rotary embeddings (IARoPE) that tailor positional frequencies to the specific low-frequency spectrum of each input series (Feng et al., 30 Sep 2025).
Self-Attention and Cross-Scale Aggregation: Multi-head self-attention layers capture long-range dependencies, while advanced architectures (e.g., UniTS, MSFT) incorporate cross-scale and cross-variable aggregation to explicitly model multi-resolution patterns (Qiao et al., 17 Jun 2025).
Output Heads: TSFMs output point forecasts, quantiles, parametric predictive marginals (Gaussian, Student-t), or generative trajectory ensembles, with the majority of published models supporting either deterministic point-readout or probabilistic univariate heads (Perez-Diaz et al., 22 Oct 2025). Only a minority support direct joint trajectory simulation.

Pretraining objectives include maximum likelihood on next-step or next-token (cross-entropy after quantization), mean squared error (MSE) over patches, quantile (pinball) loss, or negative log-likelihood for probabilistic heads. "Masking" (inspired by BERT) and contrastive objectives are also employed, especially for imputation and classification tasks (Xie et al., 4 Aug 2025).

2. Core Functionalities: Forecasting, Zero-Shot Transfer, and Covariate Adaptation

TSFMs achieve strong performance in zero-shot forecasting, where the pretrained model is directly applied to new, unseen series or tasks without any parameter updates. For instance, Moirai and TimesFM achieve substantial error reductions (up to 33% RMSE, 39% MAE, and 49% CPC gains) on zero-shot crowd flow prediction over large mobility datasets, with no need for spatial graph construction or labeled retraining (Luca et al., 1 Jul 2025).

The formal zero-shot protocol is as follows: a historical context (length L) is encoded via the pretrained model f_θ to yield a future horizon forecast of length H, without any gradient updates. This contrasts with classical models, which require retraining per series (Meyer et al., 15 Oct 2025, Luca et al., 1 Jul 2025).

Covariate adaptation remains a challenge due to the predominant univariate focus of pretraining. The CoRA (Covariate-awaRe Adaptation) framework attaches frozen TSFM backbones to modular adapters that integrate exogenous variables (time series, text, images) using learned Granger Causality Embedding (GCE) and zero-initialized condition-injection, yielding up to 31% MSE reduction compared to state-of-the-art covariate-aware deep forecasters (Qin et al., 14 Oct 2025). CoRA achieves principled covariate selection and effective adaptation across modalities, domain heterogeneity, and few-shot settings, while preventing catastrophic forgetting.

3. Theoretical Biases, Scaling, and Internal Semantics

TSFM performance and transferability depend critically on architectural "knobs"—patch size, embedding formulation, and loss function—which introduce implicit biases. Increasing patch size leads to a temporal low-pass filter effect, favoring low-frequency (trend, seasonality) components, but impeding accurate modeling of high-frequency, outlier, or chaotic dynamics. Choice of embedding (quantized vs. continuous) trades off geometric preservation and robustness to outliers. Loss choice (MSE, quantile, cross-entropy) governs regression-to-the-mean bias, with MSE collapsing multimodal or chaotic dynamics towards mean trajectories (Yu et al., 22 Oct 2025).

Synthetic pretraining data generated via Gaussian Process kernel composition and structural causal models (as in CauKer) enables systematic scaling law analysis. Zero-shot accuracy scales with N^-α for dataset size (α≈0.07–0.1) and P^-β with model capacity (β≈0.02–0.04), with synthetic-only pretraining matching or outperforming real-data variants for classification TSFMs (Xie et al., 4 Aug 2025).

Interpretability studies relying on concept probing and Centered Kernel Alignment (CKA) have revealed a hierarchical organization of temporal semantics: early layers encode local autoregressive and trend features, while deep layers capture variance change-points and partial spectral structure. However, probe recoverability for spectral or warping concepts remains limited, especially in compositional regimes with superposed patterns (Pandey et al., 19 Nov 2025). Block-wise neuron redundancy is substantial, enabling aggressive pruning with negligible loss (Wiliński et al., 2024).

4. Forecast Types, Expressiveness, and Operational Considerations

TSFMs differ in the form of forecast produced, which constrains operational applicability:

Forecast Type	Definition / Output	Sufficient For	Limitations
Point	Deterministic vector (ŷ₁,...,ŷ_H)	Mean prediction, pointwise accuracy (e.g. MSE)	No native uncertainty; cannot answer pathwise
Quantile	Marginal quantiles Q_k(τ) for k≤H	Pointwise intervals, marginal calibration	Cannot determine joint/temporal events
Parametric	Marginal p(y_k	·) via fitted distribution	Probabilistic intervals, calibration
Trajectory	Joint trajectories Y_{1:H}^{(m)} ∼ p(·	X₁:T)	All pathwise questions; event/run-length analysis

Trajectory ensembles are strictly more expressive: marginals can be marginalized from joint, but the reverse requires unverifiable copula or dependence assumptions. Many operational tasks—simultaneous pathwise intervals, path-dependent event probabilities, or scenario generation—require trajectory-level output (Perez-Diaz et al., 22 Oct 2025). Most current TSFMs produce only point or parametric marginals; only a minority directly sample paths.

5. Empirical Capabilities and Limitations Across Domains

Extensive benchmarking reveals both the breadth and the bounds of TSFM generalization:

Mobility and Flow Forecasting: Moirai and TimesFM outperform all baselines on multivariate crowd flow tasks under strict zero-shot settings, reducing the RMSE/MAE by up to 33/39% and achieving up to 49% CPC improvement (Luca et al., 1 Jul 2025). The transformation of spatially coupled OD flows into collections of purely temporal series enables scalable forecasting without spatial graphs.
Financial Time Series: Pretrained, fine-tuned TSFMs (e.g., Tiny Time Mixers) achieve 25–50% error reduction on small data, and require up to 3–10 fewer years of data for similar performance compared to untrained baselines. However, in certain tasks (volatility, yield spread), specialized econometric models (e.g., GARCH, ECM) can still outperform TSFMs, highlighting the need for further domain-specific innovation (Marconi, 9 Jul 2025).
Astronomical Data: Chronos and Chronos-Bolt, without any astronomy pretraining, deliver state-of-the-art out-of-distribution source detection and near-best unsupervised classification of variable stars on ZTF "light curves," significantly outperforming domain-transformers and traditional hand-crafted features in several OOD settings (Li et al., 7 Oct 2025).
Building Energy Management: In zero-shot settings, TSFMs only marginally outperform or even underperform test-time-fitted statistical models on unseen modalities and covariate-rich environments. Their covariate integration remains a bottleneck, though zero-shot representations enable competitive downstream classification (Mulayim et al., 12 Jun 2025). Fine-tuning (full or LoRA) brings TSFMs to state-of-the-art on inference-limited data, as demonstrated by 50% error reduction in RMSSE/MASE on building energy forecasts (Park et al., 31 May 2025).
Anomaly Detection: TSFMs in their current form are consistently outperformed by statistical (XGBoost) or autoencoder baselines in both detection and prediction tasks for rare events, due to limited interpretability, high resource requirements, and lack of explicit anomaly modules (Shyalika et al., 2024).
Macroeconomic and Regime-Shifted Domains: In zero-shot macroeconomic forecasting, TSFMs like Moirai and Chronos match or exceed ARIMA, VAR, and central bank nowcast benchmarks in stable regimes; however, performance degrades post-structural break, owing to over-reliance on pretraining priors and absence of local calibration (Jetwiriyanon et al., 30 May 2025).

6. Ensemble Enhancement, Evaluation, and Benchmarking Best Practices

Statistical ensemble and hybrid techniques can substantially enhance TSFM reliability, uncertainty quantification, and bias correction:

Bagging and Stacking: Bootstrap aggregation of TSFM outputs reduces variance, especially on long-context sequences; regression-based stacking with traditional forecasters yields the lowest MSE (Modi et al., 18 Aug 2025).
Residual Modeling: Boosting-style residual correction and iterative feedback address systematic model bias, often yielding up to 67% MSE reduction.
Prediction Intervals: Both parametric (variance head) and non-parametric conformal methods (split conformal prediction) paired with TSFMs deliver well-calibrated (≈95% nominal) intervals, with notable benefits in low-data regimes (Achour et al., 9 Jul 2025).
Evaluation Protocols: Robust TSFM assessment mandates strict data-lineage tracking (to avoid pretrain-test leakage), spatiotemporal holdouts, rolling time splits, and use of pathwise metrics (e.g., Energy Score, CRPS) tailored to forecast type (Meyer et al., 15 Oct 2025, Perez-Diaz et al., 22 Oct 2025). Recent work calls for public data registers, cryptographic data hashes, and future-split benchmarks to mitigate leakage, cross-domain contamination, and "memorization of global patterns."

7. Open Challenges and Prospects

Several unresolved challenges define the TSFM research frontier:

Scalability and Efficiency: Redundant representational blocks can be pruned to halve model size and reduce inference time with negligible loss, but further progress awaits sparsity-inducing regularization and adaptive patching (Wiliński et al., 2024, Feng et al., 30 Sep 2025).
Representation and Interpretability: Current TSFMs capture atomic (trend, mean, AR) concepts linearly and hierarchically, but composition of multiple concepts degrades representation and parameter recoverability (Pandey et al., 19 Nov 2025). Richer, nonlinear, or causal probes may be required to fully expose internal semantics.
Forecast Type-Task Alignment: Alignment of forecast form and downstream application remains incomplete, as marginal-only models cannot address path-dependent queries natively (Perez-Diaz et al., 22 Oct 2025).
Benchmarks and Data: Existing datasets are contaminated by pretrain/evaluation overlap and lack global out-of-sample splits. Methodologically robust benchmarks—spanning multiple domains, truly future data, and public registers of usage—are essential for valid progress (Meyer et al., 15 Oct 2025).
Covariate, Multimodal, and Multi-task Integration: Modular, causally-aware frameworks (e.g., CoRA) provide the template for integrating arbitrary external information—covariates, text, vision—but further research is needed to generalize adaptive gating and temporal aggregation for joint multimodal adaptation (Qin et al., 14 Oct 2025).

TSFMs thus represent a unifying paradigm for time series analysis: highly scalable, transferable, and adaptive, but with fundamental limitations in anomaly handling, pathwise reasoning, covariate integration, and realistic benchmarking that demand ongoing methodological innovation.