Moirai: Time Series Forecasting Model

Updated 26 April 2026

Moirai is a family of Transformer-based time series foundation models designed for probabilistic forecasting via any-variate self-attention and mixture-distribution output heads.
It leverages techniques like multi patch-size input projection and rich embedding layers to effectively manage varied data frequencies and multivariate inputs.
Its extensible, zero-shot forecasting capabilities combined with efficient fine-tuning (e.g., with LoRA/DoRA) yield robust performance across energy, transport, and sensor benchmarks.

Moirai is a family of large-scale time series foundation models (TSFMs) built around the Transformer architecture, designed for probabilistic and distribution-aware forecasting across arbitrary domains, frequencies, and multivariate input widths. The core innovations in Moirai—including “any-variate” self-attention, cross-frequency input patching, and learnable mixture-distribution forecasting heads—enable a single pre-trained masked-encoder (and, in later iterations, decoder-only) model to scale zero-shot and fine-tuned forecasting to diverse target settings, outperforming or matching domain-specific alternatives on several public benchmarks. Moirai’s extensible architecture, demonstrated performance on canonical probabilistic and spatio-temporal tasks, and emerging variants such as Moirai 2.0, have positioned it as a central reference for TSFM research.

1. Architectural Foundations and Model Variants

The canonical Moirai model is an encoder-only (masked-encoder) Transformer, with model sizes ranging from 14 million (“Small”), 91 million (“Base”), to 311 million (“Large”) parameters (Woo et al., 2024, Sartipi et al., 9 Jun 2025). Moirai applies a sequence of multi-head self-attention layers and pointwise MLPs over a unified token sequence derived from time series data. The distinctive components are:

Any-variate attention mechanism: Time and variate (feature) axes are flattened and mixed, with binary, learnable attention biases marking “within-variate” versus “cross-variate” associations. This architecture enables native multivariate context, permutation equivariance, and robust cross-sensor forecasting (Woo et al., 2024, Gupta et al., 7 Nov 2025).
Multi patch-size input projection: To accommodate arbitrary sampling frequencies, variable-length nonoverlapping time patches are linearly embedded, indexed by patch size and frequency. For example, hourly data use patch sizes of 32–64, while higher-frequency data use smaller patch sizes (Woo et al., 2024).
Rich embedding layers: Positional encodings use rotary or sinusoidal embeddings; learnable “variate ID” vectors distinguish target versus covariate or market-feature tokens (Lettner et al., 16 Apr 2026, Woo et al., 2024).
Mixture distribution output heads: Parameters for a mixture of Student-t, log-normal, negative-binomial, and sharp Gaussian distributions are emitted for each forecasted position, supporting robust tail behavior, skewed targets, and count data via end-to-end negative log-likelihood objectives (Woo et al., 2024, Lettner et al., 16 Apr 2026).

Moirai 2.0 departs from this architecture with a causal (decoder-only) Transformer backbone, quantile loss (“pinball” loss), recursive multi-quantile decoding, multi-token predictions, and a simplified single-patch embedding strategy. This achieves improved inference speed, a smaller parameter footprint, and stronger accuracy on standard time series forecast benchmarks (Liu et al., 12 Nov 2025).

2. Mathematical Formulation and Theoretical Guarantees

Moirai’s core operations are defined as follows:

Tokenization: Each time series $\mathbf{y}\in\mathbb{R}^{d\times T}$ is split into patches (size $P$ ), linearly mapped to embeddings, and summed with position and variate encodings.
Self-attention layer:

$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(QK^\top/\sqrt{d_k})\,V$

with $Q = XW^Q$ , $K=XW^K$ , and $V=XW^V$ .

Any-variate bias:

$E_{ij,mn} = (\mathbf{q}_{i,m})^\top R_{i-j}\mathbf{k}_{j,n} + u^{(1)}1_{m=n} + u^{(2)}1_{m\neq n}$

ensuring permutation equivariance and intra/inter-variate balancing in attention (Woo et al., 2024).

Output mixture modeling:

$p(y|\hat{\boldsymbol\theta})=\sum_{i=1}^c w_i\,p_i(y|\hat{\boldsymbol\theta}_i)$

where $p_i$ ranges over Student-t, log-normal, negative-binomial, and Gaussian distributions.

Theoretical work has shown that the Moirai architecture implements in-context gradient-descent for multivariate autoregressive (AR) models, offering universality guarantees for AR $_d(q)$ sequences up to bounded order and dimensionality, and providing non-i.i.d. generalization bounds under Dobrushin’s condition (Wu et al., 5 Feb 2025).

3. Pretraining Regimen and Datasets

Moirai was pretrained on the Large-scale Open Time Series Archive (LOTSA), comprising over 27.6 billion observations from nine domains (energy, transport, cloud operations, climate, sales, economic/financial, healthcare, web, nature) and covering frequencies from yearly to second-level (Woo et al., 2024). Training used random crops, horizon masking, and curriculum sampling over domain/frequency/covariate configurations.

Smaller patch sizes were used at higher frequencies to ensure token sequences of length ≤512 per example. Patch-specific embedding and decoding layers allow efficient handling of variable-resolution data.

Moirai 2.0 is pretrained on a newly assembled corpus of 36 million univariate series (295 billion observations), incorporating real-world, synthetic, and mixed-up data sources from the Gift-Eval and Chronos-Mixup benchmarks (Liu et al., 12 Nov 2025).

4. Empirical Performance Across Domains

Forecasting competitions and benchmarks:

On the Monash, Gift-Eval, and large-scale energy/electricity datasets, Moirai (small/base/large) matches or outperforms task-specific, full-shot deep learning and classical models, such as ARIMA, N-BEATS, DeepAR, and PatchTST, in both MAE and CRPS under zero-shot settings (Woo et al., 2024, Lettner et al., 16 Apr 2026, Sartipi et al., 9 Jun 2025, Liu et al., 12 Nov 2025).
On origin-destination (OD) flow benchmarks for urban mobility, Moirai-Large achieves up to 33% lower RMSE, 39% lower MAE, and 49% higher common-part-of-commuters (CPC) versus deep learning and statistical baselines, outperforming specialized spatial and LSTM-based architectures in a strict zero-shot, spatially-unaware regime (Luca et al., 1 Jul 2025).
For spatio-temporal sensor forecasting, Moirai’s joint any-variate attention yields the lowest MAE and RMSE across a broad range of spatial coverages and sampling intervals, beating both univariate TSFMs and graph neural net (STGNN) approaches (Gupta et al., 7 Nov 2025).

Electricity Price Forecasting:

For day-ahead European market price forecasting, Moirai variants deliver competitive accuracy with traditional and ML-based methods. While zero-shot Moirai lags strong biseasonal seasonal-trend decomposition models (MSTL), it consistently outperforms closed-source TSFMs (e.g., TimeGPT) and larger foundation models that lack domain customization (Sartipi et al., 9 Jun 2025, Lettner et al., 16 Apr 2026). Fine-tuned Moirai Base yields a CRPS of 14.22 EUR/MWh, outperforming task-specific NHITS+QRA and normalizing flow baselines, and approaching ChronosX (Lettner et al., 16 Apr 2026).

Optimization-centric fine-tuning:

Decision-focused fine-tuning (DFF) of Moirai substantially lowers operational costs in dispatchable feeder scheduling versus conventional prediction-focused MSE-tuned models, with up to 20% cost improvements under LoRA/DoRA adapter tuning (Beichter et al., 3 Mar 2025).

5. Limitations and Observed Weaknesses

While Moirai achieves broad generalizability, empirical work identifies several limitations:

In electricity price forecasting, Moirai does not consistently outperform highly specialized statistical models (MSTL, ElasticNet), particularly in strongly seasonal domains. Its lack of explicit exogenous covariate integration and absence of domain-specific seasonal decomposition limits peak accuracy (Sartipi et al., 9 Jun 2025).
In physiological signal analysis, Moirai embeddings exhibit substantial feature entanglement, temporal distortion, and scenario non-separability in medical simulation transfer, with downstream decoding accuracy dropping by >20% compared to raw signals. These effects are attributed to global masked attention and lack of physiological encoding regularizers (Christenson et al., 2024).
The architecture’s scaling with parameter count plateaus quickly; increasing model size beyond ~11 million parameters gives no further gains (and may degrade results) (Liu et al., 12 Nov 2025).
Performance on long-horizon predictions deteriorates steadily; this suggests architectural or training modifications are needed for stable long-range forecasting (Liu et al., 12 Nov 2025).
For extreme covariate or frequency configurations (e.g., >128 variables, combined text/tabular contexts), current implementations do not support multi-modality or extremely high-dimension series (Woo et al., 2024).

6. Practical Deployment, Adaptation, and Extensions

Moirai’s main strength lies in reusable, zero-shot forecasting capabilities across domains. The absence of required per-dataset training, combined with competitive or superior accuracy, enables deployment as a plug-and-play solution with minimal setup (Sartipi et al., 9 Jun 2025).

Parameter-efficient adaptation with LoRA/DoRA adapters enables robust few-shot fine-tuning for instance-specific or decision-aware optimization (e.g., smart grid, scheduling) at a small fraction of the original parameter count and without full model retraining (Beichter et al., 3 Mar 2025).

Moirai 2.0 redefines the architecture for greater efficiency:

Causal decoder-only backbone
Quantile loss with recursive multi-quantile decoding (depth-2 expand-collapse inference)
Multi-token prediction (up to 20× fewer autoregressive steps due to KV-caching)
Simpler, single-patch input, omitting mixture heads

Compared to Moirai 1.0-Large, Moirai 2.0 achieves higher accuracy (MASE=0.728, CRPS=0.516), is 30× smaller, and is at least twice as fast (Liu et al., 12 Nov 2025). These changes isolate the gains to decoder-only design, quantile training, and output decoding innovations.

Recommended deployment strategies favor Moirai for dense sensor networks (K≥16, high sampling rate), zero-shot domain transfer, and practical settings where rapid adaptation or minimal annotation is prioritized. Continued architectural development is targeting cross-variate/multimodal integration, improved long-horizon stability, and tailored regularization for scientific or medical transfer (Liu et al., 12 Nov 2025, Christenson et al., 2024).

7. Future Directions and Ongoing Research

Research avenues extend toward:

Learned cross-frequency and multi-modal modules instead of hard-coded patch/frequency mappings (Woo et al., 2024)
Improved regularization (temporal smoothness, feature disentanglement) and physics-informed embedding design for sensitive scientific and medical domains (Christenson et al., 2024)
Scaling laws research for optimal data/model size tradeoffs (Liu et al., 12 Nov 2025)
Efficient approximate or prioritized decoding for large quantile sets (Liu et al., 12 Nov 2025)
Agentic and reasoning-augmented forecasting by combining Moirai with LLMs for root-cause analysis, warning, and decision-aware tasks (Liu et al., 12 Nov 2025)

Moirai’s generalization guarantees, architectural flexibility, and robust real-world benchmarks establish it as a core model in time series foundation model research, though domain-specific adaptations and architectural innovations continue to be required for state-of-the-art performance in specialized regimes.

References: