Papers
Topics
Authors
Recent
2000 character limit reached

ForecastPFN: Zero-Shot Time Series Forecasting

Updated 31 January 2026
  • ForecastPFN is a zero-shot neural forecasting model that uses synthetic data to approximate Bayesian inference for time series prediction.
  • It employs an encoder-only transformer trained on diverse synthetic data, enabling efficient predictions from limited observations with a single forward pass.
  • ForecastPFN outperforms traditional models in short-data regimes and offers extensions to multivariate and latent-space forecasting frameworks.

ForecastPFN is a neural forecasting model founded on the Prior-data Fitted Network (PFN) paradigm, which enables zero-shot time series prediction via synthetic data pretraining. It is designed to approximate Bayesian posterior inference for time series, allowing for immediate application to new datasets without model retraining or fine-tuning. The core innovation is the use of a highly diverse synthetic data generator to train a transformer architecture, equipping ForecastPFN with the ability to generalize across real-world patterns and resource-constrained regimes. Subsequent developments integrate ForecastPFN mechanisms into multivariate, foundation, and latent-space architectures.

1. The PFN Framework and ForecastPFN Formulation

ForecastPFN exploits the PFN framework, wherein a neural network is trained entirely on data sampled from a known parametric prior to emulate Bayesian inference in a single pass. For univariate time series, the PFN aims to compute the posterior predictive distribution for future values yy_* given observed data DinD_\text{in}:

p(yDin)=p(yϕ)p(ϕDin)dϕp(y_* \mid D_\text{in}) = \int p(y_* \mid \phi) p(\phi \mid D_\text{in})\, d\phi

The model qθq_\theta is fitted such that

qθ(yDin)E[yDin]q_\theta(y_* \mid D_\text{in}) \approx \mathbb{E}[y_* \mid D_\text{in}]

ForecastPFN operates strictly in zero-shot mode: the weights θ\theta are fixed after pretraining, and for any new input, prediction is performed with a single transformer forward pass. This architecture scales well to forecasting with extremely limited observations (down to 36 points), supporting robust, fast inference in data-scarce scenarios (Dooley et al., 2023).

2. Synthetic Data Generation

Central to ForecastPFN is a high-diversity, parametric generator for synthetic time series. Each synthetic sample yty_t is constructed as the product of an underlying smooth signal ψ(t)\psi(t) and multiplicative noise ztz_t:

yt=ψ(t)zty_t = \psi(t) \cdot z_t

The signal comprises an additive-exponential trend, multiple periodicities (weekly, monthly, yearly, with harmonics), and individualized seasonal amplitudes. Fourier coefficients for each frequency are sampled from a zero-mean Gaussian and normalized such that their sum of squares is unity. Noise is generated via a Weibull distribution centered at one:

zWeibull(1,k),zt=1+(zzˉ),zˉ=(ln2)1/kz \sim \mathrm{Weibull}(1, k), \quad z_t = 1 + (z - \bar{z}), \quad \bar{z} = (\ln 2)^{1/k}

Key global and local hyperparameters are sampled per series, with full details enumerated for reproducibility. This broad prior captures trends, seasonality, non-stationarity, and stochastic variation, essential for generalization across real application domains (Dooley et al., 2023).

3. Bayesian Training Objective

ForecastPFN is trained via empirical risk minimization over the synthetic prior, minimizing mean squared error:

L(θ)=Eϕp(ϕ),Dp(Dϕ)[t=+1+H(ytqθ(t,D1:))2]L(\theta) = \mathbb{E}_{\phi \sim p(\phi), D \sim p(D\,|\,\phi)}\left[ \sum_{t=\ell+1}^{\ell+H} (y_t - q_\theta(t, D_{1:\ell}))^2 \right]

During training, noise is omitted from the targets (loss computed against ψ(t)\psi(t)), expediting convergence. The transformer employs robust input normalization (outlier removal, z-score clipping at 3σ3\sigma), mitigating numeric instability across variable real-world scales.

4. Transformer Architecture and Implementation

ForecastPFN uses an encoder-only transformer consisting of two blocks with four attention heads each. Input tokens embed timepoint features (year, month, day, weekday, day-of-year), robustly scaled values, and an explicit query token for each future prediction. The output is scalar for each prediction. The embedding dimension is typically set to $128$, and feed-forward layers expand to 32×demb32\times d_{\text{emb}}, then 8×demb8\times d_{\text{emb}}.

Training is performed on 300,000 series of length T=200T=200, yielding 30\approx 30 million sliding window tasks. Optimizer is Adam with a learning rate of 10410^{-4} over $600$ epochs; batch size is $1024$ (Dooley et al., 2023).

5. Inference and Zero-Shot Procedure

At inference, ForecastPFN requires only the most recent \ell observations and corresponding time indices:

  • For desired forecast horizon HH, provide query tokens for each t=+1,,+Ht^* = \ell+1, \ldots, \ell+H
  • One forward pass through the fixed transformer generates {y^t}t=+1+H\{\hat{y}_{t^*}\}_{t^*=\ell+1}^{\ell+H}

This approach yields deterministic predictions, with no need for retraining when applied to a new dataset. In practical deployments, input length \ell can range from 36 to 1000 (Dooley et al., 2023).

6. Empirical Performance and Comparison

ForecastPFN delivers competitive or superior accuracy to traditional (ARIMA) and transformer-based sequence forecasting models, especially under restricted data budgets:

Dataset ARIMA FEDformer ForecastPFN (50 pts)
ECL (50) 1.84 0.68 1.08
ETTh1 (50) 0.34 0.40 0.13

ForecastPFN consistently achieves the highest number of MSE-win counts on standard datasets (ECL, ETTh1/2, Exchange, Illness, Traffic, Weather), particularly in the regime where competitors are restricted to $50$–$250$ points or $1$–$30$ s of training time. Inference for a new dataset requires approximately $0.2$ s, with competitors needing 100×\sim 100\times longer to train (Dooley et al., 2023).

7. Strengths, Limitations, and Extensions

ForecastPFN is:

  • Fully zero-shot (requires no real data for pretraining or adaptation)
  • Robust to a range of real-world trends and periodicities due to the synthetic prior
  • Fast at inference

Known limitations:

  • Univariate only (no multivariate modeling without modification)
  • Trained with human-like seasonalities; performance may decrease with exotic frequencies
  • Produces point forecasts; does not model uncertainty intervals
  • Transformer input length limited to approximately 1000 timesteps

Proposed extension avenues include multivariate generalization (see TimePFN (Taga et al., 22 Feb 2025)), exogenous covariate integration, probabilistic heads, sparse attention for longer sequences, and explicit handling of missing/irregular data.

8. Integration Into Latent and Foundation Architectures

Recent work integrates ForecastPFN paradigms into broader architectures such as LaT-PFN (Verdenius et al., 2024) and TimePFN (Taga et al., 22 Feb 2025):

  • LaT-PFN combines PFN and Joint Embedding Predictive Architecture (JEPA), operating in latent spaces with context aggregation and abstract time normalization, yielding enhanced zero-shot generalization and the emergence of discrete latent patch tokens representing local structure.
  • TimePFN extends the generative prior to multivariate series via Gaussian-process kernel banks and the Linear Model of Coregionalization, training a channel-mixed transformer for MTS zero- and few-shot forecasting.
  • TempoPFN applies PFN principles to linear RNN architectures, scaling synthetic pretraining to long sequence lengths and yielding robust zero-shot evaluation on benchmarks such as Gift-Eval (Moroshan et al., 29 Oct 2025).

This convergence of approaches broadens the applicability of PFN-style forecasting to foundation models, multivariate contexts, and latent representation learning.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ForecastPFN.