Pre-trained TSFMs: Unified Time Series Modeling

Updated 11 November 2025

Pre-trained TSFMs are foundational models that learn from large, diverse time series datasets to generate generalized, transferable representations.
They leverage advanced transformer architectures and specialized tokenization schemes to capture temporal patterns, handle multimodal inputs, and improve efficiency.
These models enable zero-shot and few-shot adaptation across tasks like forecasting, classification, anomaly detection, and decision support, often outperforming traditional methods.

A Pre-trained Time Series Foundation Model (TSFM) is a machine learning model trained on large, heterogeneous time series corpora using self-supervised or cross-modal objectives, with the explicit goal of developing generalized, transferable representations that can be rapidly adapted—often zero-shot or few-shot—for diverse forecasting, classification, anomaly detection, and decision-support tasks. TSFMs aim to unify historically specialized approaches within a single, versatile paradigm, leveraging architectural choices and pre-training strategies developed for sequence modeling at scale. This entry synthesizes the technical and practical state of pre-trained TSFMs, including their design, training regimes, adaptation mechanisms, evaluation, efficiency strategies, and interpretability, as well as emerging challenges and research problems.

1. Paradigms and Unifying Principles

TSFMs supersede task-specific models by exploiting massive cross-domain datasets and transferring pre-trained knowledge to new time series problems with minimal adaptation. Two main strategies are established:

From-scratch TSFM Pre-training: Models such as ForecastPFN, TimeGPT, TimesFM, GTT, and Lag-Llama are trained exclusively on time series, often exceeding 100 billion observations and hundreds of thousands of series (Ye et al., 3 May 2024, Kottapalli et al., 5 Apr 2025). Pre-training adopts encoder-only, decoder-only, or encoder–decoder Transformer architectures.
Foundation LLM Adaptation: This encompasses two approaches:
1. Embedding-visible adaptation—embedding numeric time series patches into the LLM's representation space and performing partial or full fine-tuning (e.g., FPT, TEMPO, LLM4TS, TimeLLM).
2. Text-visible adaptation—serializing time series as prompts, then leveraging in-context learning or prompt-based prediction in a mostly frozen LLM (e.g., PromptCast, TWSN, LLMST) (Ye et al., 3 May 2024).

Distinguishing attributes of TSFMs in contrast to classical models include:

Unified representation across diverse tasks (forecasting, classification, anomaly detection, etc.).
Generalization and transferability: robust zero-shot/few-shot adaptation across domains, frequencies, and modalities (Liu et al., 14 Mar 2025).
Cross-modal and multimodal integration: Incorporation of textual, visual, or other exogenous metadata via prompt engineering or auxiliary modules (Qin et al., 14 Oct 2025).
Explainability: Foundation models can generate rationales (e.g., chain-of-thought) or support attribution, offering interpretability beyond traditional black-box models (Ye et al., 3 May 2024).

2. Architectures and Tokenization

The dominant TSFM backbones are Transformer variants, with emerging alternatives based on state-space models and vision transformers:

Transformer Backbones:
- Encoder-only: TimeCLR, GTT, ForecastPFN, MOIRAI.
- Decoder-only: TimesFM, Lag-Llama.
- Encoder–decoder: TimeGPT, Chronos (T5-based) (Ye et al., 3 May 2024, Kottapalli et al., 5 Apr 2025).
- Mixture of Experts: Time-MoE (sparsely activated blocks with expert gating), Synapse arbitration (Das et al., 7 Nov 2025).
- MLP–Mixer: Tiny Time Mixers (TTM), enabling parameter-efficient, patch-based mixing (Marconi, 9 Jul 2025).
Tokenization Schemes:
- Patch slicing: Partition length-T time series into consecutive, possibly overlapping patches (size p). Each patch is embedded via MLPs or convolutions—essential for long-context efficiency (cf. PatchTST, MOIRAI, TimesFM).
- Lag vectors: Stack values at multiple lags into tokens preserving temporal causality (Lag-Llama).
- Dynamic patching: Kairos introduces MoS-DP, where patch size is adaptively selected per region via a gating mechanism (Feng et al., 30 Sep 2025).
- Continuous-time modeling: FlowState leverages linear SSM encoders and a functional-basis decoder for flexible time-scale and resolution invariance (Graf et al., 7 Aug 2025).
- Multivariate-to-image: VisionTS++ colorizes multivariate series into images for MAE-ViT architectures; subfigure-to-color assignment preserves inter-variate dependencies (Shen et al., 6 Aug 2025).
Position/Time Encoding:
- Standard sinusoidal encoding:
$PE_{i,2k} = \sin\left(\frac{i}{10000^{2k/d}}\right),\quad PE_{i,2k+1} = \cos\left(\frac{i}{10000^{2k/d}}\right)$ - Instance-adaptive rotary encoding: Kairos infers the dominant frequencies of the input and modulates RoPE angular frequencies accordingly.

3. Pre-training Objectives and Losses

TSFMs are trained primarily using self-supervised objectives tailored to the time series domain:

Objective	Formula (as in source)	Context
Masked reconstruction	$\mathcal{L}_{mask} = \mathbb{E}_{\mathbf{x}}[\\|\mathbf{x}_{mask} - f_\theta(\mathbf{x}_{corrupt})\\|_2^2]$	Encoder pre-training (autoencoding)
Next-step forecasting	$\mathcal{L}_{AR} = -\sum_{t=1}^T \log p_\theta(x_t\mid x_{<t})$	AUTOREG decoder pre-training
Contrastive learning	$\mathcal{L}_{contra} = -\frac{1}{N}\sum_{i=1}^N\log\frac{\exp(\mathbf{z}_i\!\cdot\!\mathbf{z}_i^+/\tau)}{\sum_{j=1}^N\exp(\mathbf{z}_i\!\cdot\!\mathbf{z}_j/\tau)}$	Patch-level or instance-level invariance
Knowledge distillation	$\mathcal{L}_{KD} = \alpha\,\mathcal{L}_{CE}(y, f_\theta(x)) + (1-\alpha)\,D_{KL}(\sigma(z/\tau)\\|\sigma(z^T/\tau))$	Efficient model transfer
Quantile/pinball	$L_\alpha(y, q) = (\alpha - \mathbb{I}\{y < q\})(y - q)$	Probabilistic TSFM training

Self-supervised mask/denoise, next-step, and contrastive objectives enable the learning of robust representations without labels (Liu et al., 14 Mar 2025, Ye et al., 3 May 2024). For probabilistic forecasting, models such as MOIRAI and VisionTS++ directly minimize quantile or CRPS losses (Das et al., 7 Nov 2025, Shen et al., 6 Aug 2025).

4. Data Construction: Real, Synthetic, and Mixed Regimes

Diversity and scale in pre-training data are essential for robust TSFM generalization. Notably:

Foundational Repositories: Monash, Google Trends, Wiki Pageviews, GTT’s internal set—all covering >100M+ points per repository (Ye et al., 3 May 2024, Kottapalli et al., 5 Apr 2025).
Synthetic Data: TSFMs often augment real corpora with statistically or causally plausible synthetic series. Strategies include:
- Statistical trend/seasonality/noise generation (ForecastPFN, TimesFM).
- Gaussian process kernel composition with structured causal models (CauKer) (Xie et al., 4 Aug 2025).
- GANs, VAEs, or closed-form GP sampling for arbitrary scale; statistical approaches dominate in TSFM practice (Liu et al., 14 Mar 2025).
Pretraining pipelines: Real and synthetic series are mixed by a tunable ratio during pre-training (e.g., Chronos: optimal $\alpha_{synt} \approx 10\%$ ), with data composition affecting both in-distribution and zero-shot generalization. Pure synthetic pre-training enables robust scaling laws for both dataset size and model capacity (CauKer).

5. Adaptation, Efficiency, and Fine-tuning

Adaptation Protocols:
- Zero-shot/few-shot transfer: Pre-trained TSFMs can forecast on unseen series or tasks without extra training, sometimes outperforming task-specific models with large data (Ye et al., 3 May 2024, Marconi, 9 Jul 2025).
- Parameter-efficient techniques: Adapters, prompt-tuning, and LoRA allow re-use of most backbone weights with minimal updates for each downstream task (Ye et al., 3 May 2024).
- Prune-then-fine-tune: Systematic structured pruning (of attention heads, FFN channels) to shrink the TSFM to a task-specific subnetwork before fine-tuning outperforms tuning the unpruned model and competes with small task-specific baselines, while reducing inference cost (Zhao et al., 29 May 2025).
- Arbitration/ensemble methods: Synapse combines multiple specialized TSFMs, assigning predictive weights based on recent context-aware performance and adaptively sampling from their quantile outputs, systematically outperforming static ensembles and single models on benchmark forecasting (Das et al., 7 Nov 2025).
- Covariate-aware adaptation: CoRA integrates exogenous covariates (regardless of modality) atop frozen TSFM backbones, employing zero-initialized adapters and a learned Granger-Causality-Embedding for interpretable, sample-efficient multivariate forecasting (Qin et al., 14 Oct 2025).
Domain-Specific Adaptations:
- Financial forecasting: TSFMs such as Tiny Time Mixers provide leading sample-efficiency in data-sparse markets, though domain-specialized models (e.g., GARCH, ECM) may still lead for well-understood series (Marconi, 9 Jul 2025).
- Cross-modal: VisionTS++ converts multivariate TS into images processed by ViT, using parallel quantile heads to provide full probabilistic forecasts and robust uncertainty estimates (Shen et al., 6 Aug 2025).

6. Benchmarks, Evaluation Metrics, and Empirical Findings

Benchmark Suites: TSFM-Bench (FoundTS), GIFT-Eval, Monash Archive, and domain-specific collections such as UCR/UEA, ETT, Traffic, Electricity, Weather, and specialized energy/financial datasets are often used (Li et al., 15 Oct 2024, Sartipi et al., 9 Jun 2025).
Evaluation Metrics:
- Forecasting: MSE, MAE, RMSE, sMAPE, MAPE, MASE, CRPS, Pinball Loss.
- Classification: accuracy, F1-score.
- Robustness: DTW distance, pattern-matching/peak recall metrics, dataset-specific probabilistic scoring rules (Liu et al., 14 Mar 2025, Shen et al., 6 Aug 2025).
Empirical Observations:
- Zero-shot/few-shot TSFMs match or outperform (by 5–20%) classical LSTM, TCN, and even specialized Transformer baselines for most benchmarks (Ye et al., 3 May 2024, Kottapalli et al., 5 Apr 2025).
- Statistical models (e.g., MSTL) remain highly competitive in domains with strong periodicity and limited structural drift (e.g., electricity price), limiting TSFM advantage (Sartipi et al., 9 Jun 2025).
- Transfer gains are largest in data-sparse settings with strong noise or nonstationarity (finance, healthcare, traffic), especially with appropriate synthetic augmentation (Marconi, 9 Jul 2025, Liu et al., 14 Mar 2025).
- Structured pruning, retrieval-augmented generation (TS-RAG), and arbitration (Synapse) can lift TSFM performance in challenging domains and long-horizon scenarios (Ning et al., 6 Mar 2025, Das et al., 7 Nov 2025).

7. Domains, Explainability, and Emerging Directions

Domain Taxonomy:
- General purpose (cross-domain): ForecastPFN, TimesFM, GTT, TTM.
- Finance: TWSN, TDML, CIGN; major challenges include cross-sequence dependencies and look-ahead bias.
- Traffic/Mobility: LLM-Mob, LLMST, AuxMobLCast; focus on multimodal fusion and trajectory interactions.
- Healthcare: METS, LLMFS; privacy and medical language integration.
- Energy/IoT: UMEF, PromptCast; seasonalities and sensor drift.
Explainability:
- Prompt-based chain-of-thought and local rationale generation in LLM-adapted TSFMs (Ye et al., 3 May 2024).
- Post-hoc attribution via generalized additive surrogates and gradient-based techniques.
- Retrieval-augmented generation yields explicit retrieved contexts as actionable rationales (TS-RAG, Synapse) (Ning et al., 6 Mar 2025, Das et al., 7 Nov 2025).
Current Open Challenges:
- Construction of even larger, more heterogeneous time series corpora, including labeled and unlabeled, real and synthetic, and multi-modal sources (Liu et al., 14 Mar 2025).
- Principled integration of synthetic data, improving realism and identifying rare or out-of-distribution patterns via self-improving feedback loops (Liu et al., 14 Mar 2025, Xie et al., 4 Aug 2025).
- Scalable and adaptive tokenization and positional encodings for non-uniform, highly variable, or event-driven datasets (Kairos, FlowState) (Feng et al., 30 Sep 2025, Graf et al., 7 Aug 2025).
- Incorporation of exogenous covariates, context, and meta-data in a unified, interpretable manner (CoRA) (Qin et al., 14 Oct 2025).
- Multimodal and cross-modal TSFMs, including vision-, language-, and tabular-first approaches (VisionTS++, TS-Reasoner) (Shen et al., 6 Aug 2025, Yu et al., 3 Oct 2025).
- Improved interpretability and "reasoning" capabilities, as formalized in compositional reasoning tasks and chain-of-thought generation (Potosnak et al., 9 Feb 2025, Yu et al., 3 Oct 2025).

Open research increasingly emphasizes the fusion of scale, adaptability, multimodality, and explainability, with future models anticipated to jointly optimize for universality, contextual adaptation, efficiency, and transparency in time series analytics.