Time-Series Foundation Models

Updated 26 April 2026

Time-Series Foundation Models (TSFMs) are neural sequence models that use large-scale pretraining and Transformer architectures to generalize across diverse temporal tasks.
They employ strategies such as masked reconstruction and autoregressive forecasting to capture long-range dependencies and support zero-shot or few-shot learning.
TSFMs are applied in domains like finance and energy for forecasting and anomaly detection, though they face challenges with interpretability, covariate integration, and computational cost.

Time-Series Foundation Model (TSFM)

Time-Series Foundation Models (TSFMs) constitute a class of neural sequence models leveraging large-scale, heterogeneous pretraining to generalize across diverse time-series tasks. These models are architecturally inspired by the Transformer paradigm, incorporating domain-adapted design choices for handling continuous, multivariate, and often nonstationary temporal data. TSFMs are engineered to enable zero-shot or few-shot generalization, supporting downstream tasks such as forecasting, anomaly detection, imputation, and classification across application domains, including finance, energy, mobility, and industrial process monitoring (Shyalika et al., 2024, Achour et al., 9 Jul 2025, Mulayim et al., 12 Jun 2025, Marconi, 9 Jul 2025, Chen et al., 8 Apr 2026).

1. Architecture, Pretraining, and Adaptation

TSFMs typically employ a Transformer backbone, comprising stacks of multi-head self-attention and feed-forward layers. Temporal order is encoded using rotary or sinusoidal positional embeddings. Some architectures support dynamic patching or tokenization strategies, segmenting raw series into continuous or discretized patches (Feng et al., 30 Sep 2025, Shyalika et al., 2024). Self-attention is computed via projections to queries, keys, and values, followed by a softmax-weighted sum: $\text{Attention}(Q, K, V) = \mathrm{softmax}(QK^T / \sqrt{d_k}) V.$ Multi-head attention aggregates $h$ such heads, concatenated and linearly projected.

Pretraining objectives include:

Masked reconstruction: Randomly mask input segments, learning to impute via mean squared error (MSE).
Autoregressive forecasting: Predict future points given context (minimize MSE).
Distributional modeling: Parameterize and maximize likelihood of future distributions (e.g., Gaussian, Student-t, quantile loss) (Shyalika et al., 2024, Achour et al., 9 Jul 2025, Feng et al., 30 Sep 2025).

TSFMs are pretrained on corpora comprising billions of time-stamped values spanning domains and frequencies, enabling the capture of long-range temporal dependencies, trend, seasonality, and cross-series correlations. Architectures range from simple decoder-only (e.g., TimesFM), encoder-decoder (e.g., Chronos, MOIRAI), to models with adaptive tokenization and positional encoding (e.g., Kairos) (Feng et al., 30 Sep 2025).

Adaptation to tasks relies on either direct zero-shot inference or fine-tuning. The fine-tuning process may use full model updates or parameter-efficient modules (e.g., LoRA adapters), and advanced schemes explicitly leverage multi-scale or sub-domain structure for improved transfer (Qiao et al., 17 Jun 2025, Lee et al., 3 Mar 2026).

2. Core Methodologies and Evaluation Regimes

TSFMs enable several prototypical usage modes:

Zero-shot forecasting: Direct inference on a target series without task-specific training (Achour et al., 9 Jul 2025, Shyalika et al., 2024).
Few-shot or transfer learning: Light adaptation or in-context learning for new domains or tasks (Tokic et al., 19 Nov 2025, Xu et al., 23 Feb 2026).
Parameter-efficient adaptation: Specialization through sub-domain modularity (e.g., MixFT) or multi-scale finetuning (e.g., MSFT) (Lee et al., 3 Mar 2026, Qiao et al., 17 Jun 2025).

Benchmark evaluation focuses on standardized datasets (e.g., ETT, M4, Weather, MSL, SMD), with primary metrics including MSE, MAE, RMSE, and domain-normalized scores (MASE). Anomaly detection/prediction is evaluated via F1 score, precision, and recall on labeled events (Shyalika et al., 2024). For probabilistic forecasting, CRPS and coverage rates of prediction intervals are used (Achour et al., 9 Jul 2025, Feng et al., 30 Sep 2025). Important caveats include the need for rigorous dataset partitioning to avoid information leakage from pretraining to evaluation splits (Meyer et al., 15 Oct 2025).

The forecast output format (point, quantile, parametric, trajectory ensemble) is operationally decisive: trajectory ensembles natively support path-dependent tasks, while point or marginal quantile/parametric outputs cannot answer joint or scenario-based queries without extra assumptions (e.g., copulas) (Perez-Diaz et al., 22 Oct 2025).

3. Algorithmic Innovations and Model Design

Recent advances introduce mechanisms to address the inherent heterogeneity and information density of time series:

Dynamic patching (MoS-DP): Adaptive tokenization per-instance, enabling finer granularity in regions of high information (Kairos) (Feng et al., 30 Sep 2025).
Instance-adaptive positional encoding (IARoPE): Per-series adaptation of positional signals, exploiting Fourier features (Feng et al., 30 Sep 2025).
Multi-scale finetuning (MSFT): Joint training across temporal resolutions, with masked attention and scale-adaptive adapters to target scale confounding and improve generalization (Qiao et al., 17 Jun 2025).
Federated pretraining with domain-aware aggregation (FedTRL): Bilevel regularization and prototype-based aggregation to mitigate intra- and inter-domain heterogeneity across distributed clients (Chen et al., 8 Apr 2026).
Data-driven sub-domain adaptation (MixFT): Bayesian Gaussian mixtures on pretrained embeddings partition fine-tuning data, yielding specialized adapters for improved OOD generalization (Lee et al., 3 Mar 2026).
Distillation: Horizon-weighted losses and temporal alignment between teacher and student latent states compress large TSFMs while maintaining long-range forecasting performance (Li et al., 19 Jan 2026).

For interpretability and efficiency, pruning strategies leverage representational redundancy (e.g., block-wise CKA similarity), and direct logit attribution traces output contributions to specific model components (Wiliński et al., 2024, Bao et al., 2 Feb 2026). Latent space steering, via interventions in embedding space, enables controlled manipulation of time-series features post hoc.

4. Empirical Performance, Strengths, and Limitations

TSFMs have demonstrated state-of-the-art zero-shot or transfer performance in several domains, including:

Long-horizon forecasting on diverse benchmarks, with significant gains in data-constrained or calibration-heavy settings (e.g., conformal prediction) (Achour et al., 9 Jul 2025).
Crowd flow and mobility: Outperforming statistical and deep learning baselines by up to 33% lower RMSE and 49% higher CPC without spatial inputs (Luca et al., 1 Jul 2025).
Financial forecasting: Substantial sample efficiency and transfer, but task-specific models often surpass TSFMs except when domain-specific pretraining and adaptation are employed (Marconi, 9 Jul 2025, Rahimikia et al., 23 Nov 2025).

However, limitations are well documented:

Anomaly detection/prediction: TSFMs exhibit low interpretability, poor sample efficiency for rare events, and may be outperformed by classical models (weighted XGBoost, autoencoders) in both accuracy and compute cost (Shyalika et al., 2024).
Handling of covariates: Ad hoc approaches often fail to capture joint structure; classical physical or regression models remain superior in tasks requiring covariate integration (e.g., building energy thermal modeling) (Mulayim et al., 12 Jun 2025).
Heterogeneity and representational collapse: Without specialized federated, multi-scale, or mixture methodologies, naive finetuning on mixed domains leads to representation degradation and gradient conflict (Chen et al., 8 Apr 2026, Lee et al., 3 Mar 2026).
Output form constraints: Two-thirds of TSFMs produce only point or parametric forecasts, restricting operational utility for path-dependent or scenario-based risk analysis (Perez-Diaz et al., 22 Oct 2025).
Computational cost: State-of-the-art TSFMs require millions to billions of parameters; distillation and redundancy-aware pruning can alleviate but not eliminate this burden (Bao et al., 2 Feb 2026, Li et al., 19 Jan 2026).

5. Interpretability, Internal Semantics, and Theoretical Perspectives

Mechanistic analyses reveal:

Layer redundancy: Many intermediate layers can be ablated with negligible performance loss, indicating overparameterization and suggesting pruning opportunities (Bao et al., 2 Feb 2026, Wiliński et al., 2024).
Semantic progression: Early layers specialize in local, time-domain concepts (e.g., AR(1), trend, level shifts), while deeper layers encode higher-order dispersion and change points. Spectral and time-warped features remain challenging to recover linearly or disentangle in depth, and compositional concepts introduce representation interference (Pandey et al., 19 Nov 2025).
Design-induced biases: Choices of patch size, embedding (quantized vs. continuous), and loss function induce temporal, geometric, and regression-to-the-mean biases, as captured by theory and controlled experiments. Patch size governs frequency bias (smoothing vs. high-frequency retention), embedding determines representation geometry and motif copying, and loss function controls mean/median/mode bias (Yu et al., 22 Oct 2025).
Latent space steering and intervention: Synthetic interventions enable control over learned features (e.g., adding trend or periodicity), offering a low-cost alternative to retraining for manipulating model behavior (Wiliński et al., 2024).

6. Benchmarking, Evaluation Challenges, and Best Practices

Evaluating TSFMs at scale poses acute challenges:

Dataset representativeness: Many popular benchmarks are narrow in scope (e.g., ETT consists of two power transformers), limiting claims of generalization (Meyer et al., 15 Oct 2025).
Information leakage and global pattern memorization: Overlap between pretraining and benchmark datasets, and shared exogenous shocks (e.g., COVID-19), risk inflating performance estimates (Meyer et al., 15 Oct 2025).
Recommendations: Enforce global time-based cutoffs, publish explicit data splits and hashes, adopt rolling cross-validation and domain-wise splits, report a minimal core set of robust and scale-invariant error metrics (MSE, SMAPE, CRPS), and avoid per-series retraining in zero-shot evaluations (Meyer et al., 15 Oct 2025).

7. Outlook and Research Directions

Ongoing work seeks to address documented limitations and extend TSFMs to new frontiers:

Context- and task-aware architectures: Combining structured/unstructured metadata with time-series tokens, modular multitask heads, and prompt-based paradigms for flexible task specification (Mulayim et al., 12 Jun 2025).
Covariate-sensitive and multimodal pretraining: Joint modeling of exogenous signals and time series, and integration of domain-specific knowledge (physics, constraints) to enforce plausible generative behavior (Shyalika et al., 2024).
Federated and decentralized large-scale pretraining: Prototype-level alignment and domain-aware aggregation enable scalable and privacy-preserving TSFM training across organizational boundaries (Chen et al., 8 Apr 2026).
Efficient, explainable, and task-aligned output forms: Increased use of trajectory ensembles, scenario generation, and conformal interval quantification for operational decision-making (Perez-Diaz et al., 22 Oct 2025, Achour et al., 9 Jul 2025).
Pruning, distillation, and adaptation: Systematic removal of redundancy and model compression for broad applicability in resource-constrained settings (Li et al., 19 Jan 2026, Wiliński et al., 2024, Bao et al., 2 Feb 2026).

The consolidation of best practices for benchmarking, interpretability, and adaptation—together with advances in domain-aligned pretraining and output calibration—are likely to determine the trajectory of TSFMs in both academic research and high-stakes industrial deployments.