SeqFusion: Zero-Shot Time-Series Forecasting

Updated 7 March 2026

SeqFusion is a zero-shot time-series forecasting framework that fuses predictions from a curated set of specialized pre-trained models without needing task-specific training.
The architecture uses a shared embedding space to match target series with PTMs, performing sequential prediction with similarity-weighted aggregation.
Empirical results demonstrate competitive MSE scores across benchmarks while ensuring privacy, resource efficiency, and modular adaptability.

SeqFusion is a framework for zero-shot time-series forecasting that bypasses the need for task-specific training data by sequentially fusing predictions from a curated zoo of specialized pre-trained models (PTMs). Unlike conventional methods, which aggregate vast and diverse datasets for generalized pre-training—raising privacy and logistical concerns—SeqFusion shifts the focus to acquiring a collection of compact, specialized PTMs and adaptively combining them based on the temporal characteristics of each target time series (Huang et al., 4 Mar 2025).

1. Zero-Shot Time-Series Forecasting: Problem Formulation and Motivation

Zero-shot time-series forecasting seeks to predict future values for a target time series without using any additional in-task training data. Formally, given a multivariate input $\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_C] \in \mathbb{R}^{T \times C}$ with $T$ historical observations of $C$ variates, the objective is to forecast $\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ over horizon $H$ using only pre-existing models or representations.

Traditional forecasting methods—statistical or deep learning—require substantial in-task data for effective generalization. Recent zero-shot approaches mitigate this via generalized PTMs, but performance is contingent on the diversity of pre-training data and often incurs privacy and storage trade-offs. SeqFusion introduces a paradigm shift by collecting a zoo of lightweight PTMs, each usually trained on a distinct dataset or domain, and assembling predictions by strategically fusing PTMs most relevant to the target temporal dynamics.

2. Architectural Components and Inference Workflow

SeqFusion's pipeline consists of four distinct stages:

Model Zoo Construction: Accumulate $M$ one-variate PTMs, $\mathcal{M} = \{\phi_m: \mathbb{R}^{T \times 1} \to \mathbb{R}^{h \times 1}\}_{m=1}^M$ , each pre-trained on a different dataset $\mathbf{X}_{(m)}$ .
Representation Extraction and Matching: Both the target time series variates and PTMs are mapped into a shared embedding space using a general extractor $\psi$ . The similarity between embeddings (usually cosine similarity) determines the affinity between a target variate and each PTM.
Sequential Prediction: For every target variate $c$ , select the top- $T$ 0 PTMs with embeddings closest to the variate's embedding. Recursive multi-block prediction is performed by each selected PTM, chunking the prediction horizon $T$ 1 into $T$ 2 steps.
Fusion and Post-Processing: The $T$ 3 outputs from selected PTMs are fused using a similarity-weighted average for each forecasted time point. Finally, normalization procedures ensure data is returned in the original scale.

Stage	Description	Key Technical Detail
Zoo	Collect diverse, one-variate PTMs	Each PTM trained on unique dataset
Matching	Embed series and PTMs in shared space	Use general extractor $T$ 4 with transferability loss
Prediction	Select top- $T$ 5 PTMs per variate, predict sequentially	Recursive horizon division per PTM
Fusion	Weighted ensemble of top- $T$ 6 PTM outputs	Similarity/temperature-based weighting, normalization

3. Embedding Learning and PTM Transferability

The general extractor $T$ 7 is an encoder-decoder architecture trained on a dataset $T$ 8 distinct from PTM pre-training sources. Its objective combines:

Reconstruction loss: $T$ 9
Series-wise similarity loss: Pulls masked/unmasked augmentations of each series together in embedding space.
Transferability loss:

$C$ 0

with $C$ 1, measuring how well PTM $C$ 2 generalizes to the data of $C$ 3.

Embeddings are computed as follows:

For each target variate: $C$ 4
For each PTM $C$ 5: $C$ 6, with $C$ 7 a small batch from the PTM's original dataset.

Distance or similarity is quantified via cosine similarity, though $C$ 8 metrics are also considered, enabling the selection of PTMs whose representations maximally align with those of the target variates.

4. Sequential Prediction and Multi-PTM Aggregation

For each variate $C$ 9, top- $\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ 0 PTMs are selected by maximizing similarity:

$\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ 1

Recursive forecasting proceeds as follows: at each step, each selected PTM produces a block of predictions, the most recent $\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ 2 points are re-assembled, and the next block is forecasted until the full horizon is covered.

Fusion of the $\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ 3 forecasted series is accomplished via a similarity-weighted average:

$\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ 4

where

$\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ 5

and $\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ 6 is a temperature hyperparameter controlling weight sharpness.

5. Empirical Evaluation and Performance Analysis

SeqFusion is evaluated on a suite of benchmark datasets with both multivariate and univariate settings. For multivariate scenarios, datasets include ETTh1/2, Exchange-Rate, Electricity (ECL), Traffic, Weather, and ILI, with typical look-back window 36 and horizons ranging from 6 to 48.

Baselines: Naïve (Last, Mean, SeasonalNaive), ARIMA, Prophet, deep networks (e.g., Transformer, PatchTST, iTransformer) trained on as few as 50 in-task points, and zero-shot methods such as Meta-N-BEATS, ForecastPFN, and GPT4TS.
PTM zoo: Comprised of 10 PatchTST models trained on diverse one-variate subsets (M3, M4, Tourism), using an extractor trained on $\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ 7300k subseries with transferability loss and $\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ 8 for aggregation.

Key results (MSE):

Dataset	SeqFusion MSE	State-of-the-art comparison
ECL	0.603	Best or 2nd best
ETTh1	0.600	Matches/inferior by small margin
ETTh2	0.245	Matches/inferior by small margin
Exchange	0.0217	Matches/inferior by small margin
ILI	3.496	Matches/inferior by small margin
Traffic	1.489	Matches/inferior by small margin
Weather	1.449	Matches/inferior by small margin

In univariate zero-shot benchmarks (M3, M4, Tourism), SeqFusion with 15 DLinear PTMs achieves SMAPE $\mathbf{Y} = [\mathbf{y}_1, \dots, \mathbf{y}_C] \in \mathbb{R}^{H \times C}$ 9 11–13%, ranking consistently second behind GPT4TS, while requiring only $H$ 00.05 MB per PTM.

A large-scale experiment employing PTMs from Chronos, Moirai, and TimesFM demonstrates competitive accuracy (ECL 0.5263, Weather 1.3323 MSE) with total zoo storage under 1.4 GB, compared to hundreds of GB for monolithic generalist models.

6. Ablations, Architectural Insights, and Practical Significance

Generalist vs. Specialist PTMs: A single "general" PatchTST model trained on all zoo data achieves MSE $H$ 10.83 on ECL. However, using domain-specialized PTMs and selective fusion yields 0.603, indicating the value of model specialization and strategic selection.
Zoo Composition: Adding more PTMs (10→20, including Hospital-trained models) marginally benefits domains like Illness. However, mixing architectural types (PatchTST+DLinear) can degrade performance in some domains, particularly Weather. This suggests that domain diversity is more critical than architectural diversity for optimal aggregation.
Transferability Loss: Omitting the transferability loss $H$ 2 from the representation extractor reduces accuracy on multiple benchmarks, confirming its utility for meaningful similarity assessment.
Embedding Schemes: The use of SimMTM-based embeddings with transferability loss marginally outperforms TS2Vec-based alternatives.
Aggregation Hyperparameter ( $H$ 3): Increasing $H$ 4 from 1 to 5 provides a monotonic reduction in average MSE, corroborating the ensemble effect.

7. Privacy, Resource Efficiency, and Methodological Impact

SeqFusion is inherently privacy-preserving, as it requires only sample embeddings or small batches of non-sensitive data for PTM representation calculation; full datasets never need to be exchanged. Empirically, strong zero-shot forecasting can be realized with approximately 23 MB of total model storage, a fraction of the storage footprint for large, generalist zero-shot models. This suggests practical advantages in settings with data governance constraints or resource limitations.

By consolidating select predictions from multiple PTMs judiciously matched to the temporal characteristics of each target, SeqFusion demonstrates that distributed, specialized model aggregation in embedding space is an effective alternative to monolithic model pre-training for zero-shot time-series forecasting across domains (Huang et al., 4 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SeqFusion: Sequential Fusion of Pre-Trained Models for Zero-Shot Time-Series Forecasting (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SeqFusion.