Time-Series Foundation Models

Updated 26 November 2025

Time-Series Foundation Models (TSFMs) are large, Transformer-based architectures pre-trained on massive, heterogeneous time series data to enable zero-shot forecasting.
They leverage self-supervised objectives and attention mechanisms to learn generic temporal dynamics without needing additional feature engineering or spatial inputs.
Empirical evaluations of models like Moirai and TimesFM show up to 33% lower RMSE and 49% higher CPC compared to traditional forecasting methods.

A Time-Series Foundation Model (TSFM) is a large, pre-trained neural architecture—typically Transformer-based—trained on massive heterogeneous time series corpora using self-supervised objectives, with the goal of enabling robust zero-shot or few-shot generalization across diverse forecasting tasks and domains. TSFMs encode generic temporal dynamics through distributed representations, enabling practitioners to deploy a single model across new settings, often with minimal adaptation or feature engineering. The current research frontier integrates advances in architecture, pretraining, efficient adaptation, and interpretability, while empirical evaluations benchmark TSFMs for accuracy, robustness, and generalization under varying conditions.

1. Model Architectures and Pretraining Paradigms

TSFMs are built predominantly upon Transformer variants—either encoder-only, decoder-only, or encoder–decoder architectures. Notable implementations include Moirai and TimesFM, which encapsulate the main contemporary design paradigms (Luca et al., 1 Jul 2025):

Moirai is structured as a multi-layer Transformer encoder with standard scaled dot-product self-attention, linearly projected real-valued inputs, and standard sinusoidal or learned positional encoding (precise variant unspecified). Pretraining leverages the massive LOTSA corpus of both univariate and multivariate time series, optimizing a next-step forecasting objective in a self-supervised manner. The architecture’s specifics (e.g., number of layers and hidden dimension size) are not fully disclosed but embedding dimensionality is in the few-hundreds range.
TimesFM employs a decoder-only (autoregressive) Transformer with causal masked attention. Tokenization projects multivariate OD flow vectors into an embedding space using a learned linear mapping with positional encoding. Pretraining uses a large, heterogeneous set of time series benchmarks for standard language-model–style autoregressive prediction.

Both models use pure temporal inputs—no spatial graphs or coordinate information is provided—even for inherently spatiotemporal tasks like mobility flow. This “zero-shot” protocol relies solely on the evolution of each individual series or OD flow, allowing foundation models to act as generic temporal predictors.

2. Zero-Shot Forecasting Formulation and Evaluation

Zero-shot forecasting with a TSFM proceeds by directly deploying the frozen pre-trained model on novel test series, disregarding any form of further fine-tuning or gradient updates. Formally, for a test OD pair $S^{(i)}$ with history length $T$ :

$\hat{\mathbf{s}}_{T+1}^{(i)} = f_\theta(\mathbf{s}_1^{(i)}, \ldots, \mathbf{s}_T^{(i)})$

where $f_\theta$ denotes the pre-trained, frozen model; this is performed independently for each OD series, with no adaptation (Luca et al., 1 Jul 2025).

Evaluation metrics include:

Root Mean Squared Error (RMSE):

$\mathrm{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(\hat{y}_i - y_i)^2}$

Mean Absolute Error (MAE):

$\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N}|\hat{y}_i - y_i|$

Common Part of Commuters (CPC):

$\mathrm{CPC}(\hat{T},T) = \frac{2\sum_{i,j}\min(\hat{T}_{ij},T_{ij})}{\sum_{i,j}\hat{T}_{ij}+\sum_{i,j}T_{ij}}$

This protocol enforces a strict test of transfer learning and expressivity, revealing the capacity of TSFMs to capture relevant forecasting dynamics solely from pretraining.

3. Empirical Performance and Comparative Analysis

On real-world mobility datasets (Bike NYC, Taxi Beijing, Spanish National OD), Moirai and TimesFM substantially outperform both state-of-the-art statistical models and deep learning baselines (e.g., MSAGGN), achieving up to 33% lower RMSE, 39% lower MAE, and 49% higher CPC (Luca et al., 1 Jul 2025). A quantitative summary is provided below:

Model	RMSE (Bike NYC, Taxi BJ, Spain OD)	CPC (Bike NYC, Taxi BJ, Spain OD)	MAE (Bike NYC, Taxi BJ, Spain OD)
MSAGGN	8.02 / 14.12 / 28.05	0.68 / 0.58 / 0.45	3.59 / 11.95 / 13.26
TimesFM	6.18 / 9.46 / 23.29	0.70 / 0.62 / 0.61	3.04 / 7.88 / 9.97
Moirai-L	6.09 / 9.32 / 21.74	0.72 / 0.62 / 0.67	3.04 / 7.34 / 9.94

Moirai delivers up to −32.9% RMSE and −39.9% MAE (Taxi BJ), +48.9% CPC (Spain OD) versus deep learning baselines. These results demonstrate that even in the absence of explicit spatial context or fine-tuning, foundation models can recover generic OD flow structures.

Key properties enabling this include:

Broad pretraining: TSFMs internalize temporal structure spanning thousands of domains.
Attention mechanism: Flexible, unconstrained re-weighting of past lags.
Inductive bias: Transformers natively model long-range dependencies present in spatiotemporal data.

Qualitative sensitivity analysis in (Luca et al., 1 Jul 2025) notes that performance degrades gracefully with shortened context length and remains robust under removal of spatial features.

4. Architectural, Representational, and Training Insights

Architecturally, Moirai and TimesFM rely on standard Transformer blocks with learned linear input projections and positional encoding. Explicit hyperparameters remain unspecified, but all outputs confirm the generality and scalability of the underlying architectures for time series. Pretraining is distinguished by its self-supervised, domain-agnostic paradigm, with Moirai trained on the LOTSA corpus and TimesFM on a mixture of real-world series.

Input encoding: Each OD flow is structured as an origin-wise multivariate sequence—the dimension of each time step equals the number of possible destinations—but no graph structure or coordinates are injected.
Universal applicability: TSFMs can be repurposed for any per-origin univariate or multivariate forecasting task via a simple linear embedding of raw values, maximally leveraging pretraining without the need for hand-crafted features.

These design principles—eschewing over-engineered domain priors in favor of massive, heterogeneous pretraining—are the central pillars of strong zero-shot performance (Luca et al., 1 Jul 2025).

5. Generalization and Deployment Recommendations

The authors advocate recasting applications into independent time series forecasting problems whenever possible to exploit TSFM pretraining. Instead of modeling explicit spatio-temporal interactions, decomposing inputs into per-series or per-sensor time series activates the full breadth of foundation model generalization.

Guidelines include:

Use simple linear input projections and default positional encodings.
Deploy the largest available pretrained checkpoint, as scaling up parameter count (e.g., larger Moirai variants) consistently improves zero-shot accuracy.
In domains with scarce annotation, TSFMs can be used directly.
When some domain-specific labeling is available, light fine-tuning can drive further gains on top of robust out-of-the-box results.

This approach enables practical, scalable forecasting in operational contexts where curated features, spatial adjacency, or local fine-tuning data are unavailable.

6. Limits, Open Questions, and Implications

While (Luca et al., 1 Jul 2025) establishes TSFMs as state-of-the-art for zero-shot OD flow forecasting, certain limitations are noted:

No ablation is provided on hyperparameters, input length, or impact of missing data, though qualitative remarks indicate robustness to shortened contexts.
Exact architectural and optimization details (e.g., number of heads, learning rates, embedding dimensionality) remain undisclosed.
Applicability is maximized for settings where forecasting can be formulated as a time series problem; it is less clear how far this paradigm transfers to tasks that fundamentally require explicit modeling of inter-series dependencies.

Nonetheless, the paper provides a transferable blueprint for TSFM deployment: prioritize generic signal embeddings, leverage model scale, minimize domain-specific engineering, and rely on broad pretraining for generalization.

7. Broader Context and Research Trajectory

This work positions TSFMs as direct analogues to foundation models in NLP and vision, but for sequential data. The approach demonstrates the reach of unsupervised, domain-agnostic pretraining and attention-based architectures for general-purpose prediction tasks. By delivering strict zero-shot state-of-the-art performance on mobility tasks, Moirai and TimesFM highlight the emergence of a genuine “forecasting universal prior” within the foundation model framework.

Further progress hinges on wider access to detailed architectural configurations, benchmarking on broader and more diverse tasks, and the development of evaluation suites to probe generalization, robustness, and interpretability at greater depth (Luca et al., 1 Jul 2025).

References:

"Time Series Foundation Models are Flow Predictors" (Luca et al., 1 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Time Series Foundation Models are Flow Predictors (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Time-Series Foundation Models (TSFM).