Papers
Topics
Authors
Recent
2000 character limit reached

Moirai: Universal Multivariate TSFM

Updated 14 November 2025
  • Moirai is an encoder-based Transformer for universal multivariate time series forecasting, featuring any-variate attention and patch-based input representations.
  • The model uses masked forecasting pretraining to directly predict future patches, achieving high accuracy across varied spatial–temporal regimes.
  • Extensive benchmarks show Moirai delivers lower MAE and robust cross-sensor dependency modeling compared to traditional time series forecasting models.

Multivariate TSFM Moirai is an encoder-based Transformer architecture specifically developed for universal multivariate time series forecasting. Distinguished by its any-variate attention mechanism, patch-based input representations, and flexible masked forecasting objective, Moirai enables parameter-efficient, zero-shot prediction across a wide spectrum of spatial–temporal regimes and multivariate target sizes. It was introduced and benchmarked in both the "Unified Training of Universal Time Series Forecasting Transformers" (Woo et al., 4 Feb 2024) and the systematic comparison "No One-Model-Fits-All: Uncovering Spatio-Temporal Forecasting Trade-offs..." (Gupta et al., 7 Nov 2025), establishing itself as the canonical multivariate foundation model for time series.

1. Architectural Foundations: Patch Input and Any-Variate Attention

Moirai treats the historical context of a multivariate time series as a matrix XRK×WX \in \mathbb{R}^{K \times W}, with KK channels (or sensors) and window length WW. The model partitions XX into non-overlapping, fixed-length patches—each patch sampled from a single channel at a particular resolution (patch size pp)—and linearly embeds each patch into a latent vector space, resulting in a flattened token sequence ERT×dE \in \mathbb{R}^{T \times d}.

A core innovation is the "any-variate attention" scheme: every channel's patch sequence is concatenated into a single token sequence, and standard multi-head self-attention is applied. To preserve channel identity and allow cross-channel reasoning, a learned binary bias is injected into the attention scores:

E(k,c),(,d)=(Qk,c)[RoPE(k)K,d]+u(1)1c=d+u(2)1cdE_{(k,c),(\ell,d)} = (Q_{k,c})^\top [\operatorname{RoPE}(k-\ell)K_{\ell,d}] + u^{(1)}\mathbf{1}_{c=d} + u^{(2)}\mathbf{1}_{c\neq d}

where (k,c)(k,c) and (,d)(\ell,d) index (token, channel), RoPE\operatorname{RoPE} denotes rotary positional encodings, and u(1),u(2)u^{(1)},u^{(2)} are learned scalars. All channels are thus handled permutation-equivariantly, enabling Moirai to accommodate arbitrary sensor sets without channel-embedding tables or graph adjacency matrices.

2. Masked Forecasting Pretraining and Objective Function

Moirai leverages a masked-encoder strategy. During pretraining, a random portion (15–50%) of the input patches are masked—i.e., their embeddings are replaced by a shared MASK\langle \mathrm{MASK} \rangle token—and the encoder is tasked with reconstructing the masked patches as continuous values. This aligns with the "masked forecasting" paradigm:

  • Mask positions corresponding to unseen future patches.
  • Predict parameters θ^\hat \theta of a learnable mixture distribution for each masked patch, minimizing the negative log-likelihood:

L=1Bi=1Blogp(yl+1:l+h(i)θ^(i))\mathcal{L} = -\frac{1}{B}\sum_{i=1}^B \log p\left(y^{(i)}_{l+1:l+h} \mid \hat\theta^{(i)}\right)

The model forecasts the masked (future) window directly, with the mixture head outputting Student’s-tt, log-normal, negative-binomial, and low-variance Gaussian components. This design is sensitive both to in-distribution uncertainty and to rare/irregular pattern variations.

3. Cross-Channel Dependency Learning

Unlike standard univariate TSFMs (e.g., TimesFM, Chronos), which either treat channels independently or aggregate their forecasts post hoc, Moirai’s self-attention operator natively encodes cross-sensor (cross-channel) relationships. The attention matrix Ah[u,v]A_h[u,v] measures how much patch uu (potentially from sensor ii) attends to (past or simultaneous) patch vv (potentially from sensor jj), allowing:

α(i,t),(j,t)=softmax(j,t)(qi,tkj,tdk)\alpha_{(i,t),(j,t')} = \mathrm{softmax}_{(j,t')}\left(\frac{q_{i,t} \cdot k_{j,t'}}{\sqrt{d_k}}\right)

This enables joint forecasting strategies, where each forecasted token is both temporally and spatially conditioned, obviating the need for handcrafted sensor graphs (as required by STGNNs) or frequency-level conditioning.

4. Pretraining Dataset, Optimization, and Scalability

Pretraining is conducted on the Large-scale Open Time Series Archive (LOTSA), containing over 27×10927 \times 10^9 scalar observations from 100+ public datasets spanning nine domains. Inputs are subsampled and randomly grouped to form multivariate "pseudo-series" with up to 128 channels. Patch sizes are dynamically selected to match series frequency, and context lengths routinely exceed $1000$ steps.

Relevant optimization details include:

  • Model depth L{6,12,24}L \in \{6,12,24\}, dmodel{384,768,1024}d_{\mathrm{model}} \in \{384,768,1024\}, dff=4dmodeld_{\mathrm{ff}} = 4d_{\mathrm{model}}.
  • Attention heads H{6,12,16}H \in \{6,12,16\}, single shared input/output projections per model (no frequency tags).
  • AdamW with learning rate 10310^{-3}, strong sequence packing to eliminate padding waste.

This facilitates training universal models (14M–311M params), supporting arbitrary context/horizon/channel combinations at inference.

5. Quantitative Performance and Comparative Evaluation

Empirical evaluations (Woo et al., 4 Feb 2024, Gupta et al., 7 Nov 2025) highlight Moirai’s strengths in multivariate, zero-shot forecasting:

  • On benchmark datasets (Electricity, Solar, Weather, Traffic, Turkey Power), Moirai consistently matches or outperforms state-of-the-art full-shot and alternative TSFMs.
  • Example: On Electricity (US), Moirai base attains CRPS 0.050 (best full-shot: 0.048); on Solar, CRPS 0.406 (better than full-shot baseline 0.420).
  • On the IoBT spatial sensor deployment (New Mexico testbed, 25 sensors, variable rates), Moirai achieves up to 70% lower MAE than TimesFM and significantly outperforms Chronos, particularly as spatial context (number of sensors KK) increases.
  • Performance improvements with larger KK demonstrate Moirai’s native cross-sensor dependency modeling, with MAE dropping from 2.04 to 1.85 at 5 min frequency as KK increases from 8 to 25.

A summarizing table (values trace direct to (Gupta et al., 7 Nov 2025)):

Sampling Rate # Sensors KK Chronos (MAE/RMSE) TimesFM (MAE/RMSE) Moirai (MAE/RMSE)
5 min 8 3.84 / 4.42 6.39 / 8.66 2.04 / 2.96
5 min 25 3.44 / 4.00 6.96 / 9.44 1.85 / 2.31
60 min 8 4.04 / 4.51 22.38 / 25.82 0.93 / 1.20
60 min 25 3.92 / 4.41 21.59 / 25.15 0.88 / 1.17

6. Ablation Analysis and Model Specialization

Ablation studies suggest the following:

  • Removal of multi-patch-size projections increases normalized MAE from 0.655 to 1.156.
  • Excluding "any-variate" attention raises MAE to 0.904, confirming that cross-channel reasoning is central to performance.
  • Substituting the flexible mixture likelihood with a single Student’s-tt head degrades MAE to 0.740.
  • Pretraining only on smaller archives (Monash + GluonTS) results in MAE 0.809, underlining the value of large, cross-domain diversity.

Token-level specialization (as explored in Moirai-MoE (Liu et al., 14 Oct 2024)) and related approaches offer further efficiency and accuracy improvements, but the canonical Moirai’s masked-encoder architecture with any-variate attention forms the functional backbone of universal multivariate TSFM.

7. Practical Implications and Recommendations

Moirai’s architecture is robust across varying sensor counts, sampling intervals, and application domains. For new datasets, practitioners should:

  • Select appropriate patch size PP for the series frequency.
  • Tune context length to maximize information extraction.
  • Normalize each channel separately using window statistics.

For spatial–temporal sensing (e.g., IoT, environmental monitoring), Moirai eliminates the need for hand-crafted spatial graphs or frequency embeddings; its cross-channel attention generalizes to unseen deployments without retraining. Resource scalability is determined primarily by the quadratic token count per context window, with no overhead imposed by per-channel tables.

Graph-based models may continue to excel for sparse, moderately sampled sensor graphs with carefully tuned adjacency matrices; however, Moirai establishes itself as the preferred zero-shot foundation model for dense, heterogeneous, and spatio-temporal multivariate forecasting settings.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multivariate TSFM Moirai.