Moirai 2.0: Forecasting & Decision Support
- Moirai 2.0 is a designation referring to both a decoder-only time-series forecasting model and an AI-mediated prospection platform using interactive avatars.
- In forecasting, the model leverages quantile loss, multi-token recursive decoding, and sparse Mixture-of-Experts to efficiently handle heterogeneous univariate series.
- In decision support, the platform integrates multimodal data and personalized avatars to augment human prospective cognition and guide life decisions.
Searching arXiv for primary and related Moirai 2.0 papers. Moirai 2.0 is an ambiguous designation in recent arXiv literature. In time-series forecasting, it primarily denotes a decoder-only time-series foundation model trained on a corpus of 36 million univariate series and organized around quantile forecasting, single-patch inputs, and multi-token prediction (Liu et al., 12 Nov 2025). Closely related work uses the label for a sparse Mixture-of-Experts extension, Moirai-MoE, which replaces human-defined frequency-level specialization with token-level specialization (Liu et al., 2024). A separate human-computer interaction paper uses the same name for an end-to-end platform that simulates interactive future-self avatars in order to study AI-mediated episodic prospection and decision change (Poonsiriwong et al., 5 Dec 2025). The shared nomenclature therefore refers not to a single research artifact, but to distinct systems occupying different methodological and application domains.
1. Terminological scope and disambiguation
The supplied literature uses “Moirai 2.0” in at least two distinct senses and also includes a closely related MoE-based forecasting variant. The ambiguity matters because the systems differ in objective, architecture, and evaluation protocol.
| Usage | Domain | Defining characterization |
|---|---|---|
| Moirai 2.0 | Time-series forecasting | Decoder-only TSFM with quantile forecasting and multi-token prediction |
| Moirai-MoE | Time-series forecasting | Sparse MoE TSFM with token-level specialization |
| Moirai 2.0 | AI-mediated prospection | End-to-end platform for simulating multiple future-self avatars |
In the forecasting literature, the main Moirai 2.0 model is presented as a simplification relative to Moirai 1.0: masked-encoder training, multi-patch inputs, and mixture-distribution outputs are replaced by a decoder-only architecture, a single patch, and quantile loss (Liu et al., 12 Nov 2025). In the adjacent MoE line, the central claim is that frequency is too coarse a specialization variable for heterogeneous and non-stationary time series, motivating sparse token-level routing instead (Liu et al., 2024). In the prospective-cognition line, the system is framed not as a forecaster but as a platform for augmenting human prospective cognition by simulating multiple “fates” as vivid, interactive future-self avatars (Poonsiriwong et al., 5 Dec 2025).
A common misconception is to treat these usages as different versions of one continuously evolving model family. The supplied material instead indicates two unrelated problem settings—time-series forecasting and AI-mediated life-decision support—and, within forecasting, two related but non-identical architectural directions.
2. Core forecasting architecture
In its primary forecasting sense, Moirai 2.0 adopts a pure decoder-only transformer rather than an encoder. Input tokens pass through an autoregressive stack of causal multi-head attention layers followed by feed-forward layers, with the layerwise update written as
followed by a feed-forward update on . This causal structure enables efficient KV-cache inference and computes losses on all tokens or prediction patches (Liu et al., 12 Nov 2025).
The input representation is patch-based and univariate. The raw series is split into contiguous, non-overlapping patches of size , and each patch is concatenated with a binary missing-mask indicator . A residual “PatchEmbed” block projects the concatenated vector into a -dimensional token:
The resulting token sequence is then processed autoregressively (Liu et al., 12 Nov 2025).
The output head is quantile-based rather than density-based. The model predicts quantiles, typically at levels . For target 0 and 1-quantile prediction 2, the pinball loss is
3
and the total loss across horizon 4 averages this term over horizon positions and quantile levels (Liu et al., 12 Nov 2025).
A further architectural feature is multi-token or recursive quantile decoding. From each patch token, the model forecasts 5 future patches of size 6 and quantiles 7. Rather than greedily collapsing the predictive distribution to a single path, it expands the 8 candidate continuations, forms 9 hypotheses, and then collapses by selecting the empirical 0-quantile of these candidates for each quantile level. The associated ablations identify the decoder-only backbone together with recursive multi-quantile decoding as the main contributors to the gains over Moirai 1.0 (Liu et al., 12 Nov 2025).
3. Pretraining regime, evaluation, and observed operating characteristics
Moirai 2.0 is trained on a new corpus of 36 million series, corresponding to approximately 295 billion observations. The corpus comprises Gift-Eval Pretrain with 3.25 million series and 230 billion observations, Gift-Eval TrainTest with 144 thousand series, Chronos-Mixup with 30 million series and 63 billion observations, KernelSynth with 1 million series and 1.0 billion observations, and Salesforce CloudOps with 2.15 million series and 1.48 billion observations (Liu et al., 12 Nov 2025).
Preprocessing is univariate only: multivariate data are decomposed into independent univariates. The training pipeline uses instance-wise patching, missing-mask concatenation, and instance normalization computed from the first 30% only, together with random masking of 50% of input patches per sample to increase robustness to missing segments. The reported small variant has roughly 11 million parameters, depth of approximately 6 layers, and width 1, and is trained for 100k steps with AdamW at learning rate 2, weight decay 3, 4, 5, 10k warmup steps, cosine decay, batch size 256, and bf16 precision (Liu et al., 12 Nov 2025).
Empirically, the model is evaluated on Gift-Eval’s 97 tasks spanning 55 datasets and multiple horizons. The small Moirai 2.0 ranks 5th in MASE at 0.728 and 6th in CRPS at 0.516. It outperforms Moirai 1.0-Large despite being 30 times smaller and 2 times faster. Measured inference time on 12 tasks places the small and base variants among the fastest and smallest while maintaining top-10 accuracy. The reported scaling experiments—small at 11 million parameters, base at 87 million, and large at 305 million—show no accuracy improvement with scale, and the small variant achieves the best MASE and CRPS. This suggests, in the paper’s terms, that the model is data-rather than parameter-limited at the current corpus scale (Liu et al., 12 Nov 2025).
The domain-level analysis indicates top-10 MASE performance across Cloud, Finance, Health, IoT, Retail, Telecom, and Transport, but not Nature, which is interpreted as under-representation of natural time series in pretraining. Performance is also horizon-sensitive: the model ranks 4th, 6th, and 8th in MASE for short, medium, and long horizons, respectively. The reported limitations are therefore explicit: degradation at longer horizons, absence of cross-variate modeling, reliance on univariate quantile estimation, and lack of covariates or multivariate dependencies (Liu et al., 12 Nov 2025).
4. Relation to Moirai 1.0 and to Moirai-MoE
The comparison with Moirai 1.0 is organized around three architectural substitutions: masked encoder to decoder-only autoregression, multi-patch sizes to a single patch size, and mixture-distribution negative log-likelihood to quantile loss. The ablation sequence reported on GIFT-Eval traces these changes directly. Starting from “Moirai 1.0 small” at MASE 0.946 and CRPS 0.650, the progression through a decoder-only backbone, a new corpus, quantile loss, recursive decode, random masking, and multi-token prediction ends with “Moirai 2.0” at MASE 0.728 and CRPS 0.516 (Liu et al., 12 Nov 2025).
The efficiency claims are also specific. Moirai 1.0’s masked encoder updates only about 15% of tokens per masked input, whereas the decoder in Moirai 2.0 updates 6 tokens for a 7-length series, yielding approximately 6 times faster data utilization. During inference, KV-cache support yields up to 4 times speedup at short horizons and up to 17 times at long horizons (Liu et al., 12 Nov 2025).
Moirai-MoE represents a separate design branch within the forecasting family. Its stated motivation is that original Moirai and TimesFM specialize by frequency, but frequency is not a reliable proxy for underlying temporal patterns and is too coarse to handle non-stationarity even within short context windows. Moirai-MoE therefore uses a single shared input/output projection layer while delegating pattern diversity to sparse MoE layers inside a decoder-only Transformer (Liu et al., 2024).
Its MoE module contains 8 experts with 9 active experts per token, patch size 0, and a causal normalizer with masking ratio 1. Two gating mechanisms are described: linear gating learned from scratch and cluster-based gating built from k-means centroids on Moirai token embeddings. The load-balancing auxiliary loss encourages approximately uniform routing. On 29 Monash in-distribution datasets and 10 zero-shot datasets, the model reports a 17% MAE reduction for the small variant relative to small Moirai, an 8% improvement for the base model over base Moirai, and a 7% advantage over a larger 310M-parameter Moirai. In zero-shot evaluation, the base Moirai-MoE achieves average CRPS 0.478 and MASE 0.651, compared with 0.488 and 0.689 for TimesFM, 0.499 and 0.656 for Chronos2, and 0.520 and 0.729 for Moirai-B (Liu et al., 2024).
The MoE analyses are interpretively important. Token embeddings are reported to group series by underlying pattern rather than by frequency; shallow layers use diverse experts, whereas deeper layers converge toward similar expert distributions across frequencies; and, for Traffic Hourly, expert assignments vary periodically with time-of-day. This suggests that the MoE branch operationalizes “Moirai 2.0” not as architectural simplification, but as token-level specialization and adaptive routing under non-stationarity (Liu et al., 2024).
5. Financial return forecasting as a stress test
A 2026 benchmark evaluates pretrained time-series foundation models, including Moirai-2.0, on financial return forecasting, explicitly describing the task as difficult because of low signal-to-noise ratios, structural breaks, heavy tails, and weak persistence (Alonso et al., 25 Jun 2026). In this setting, Moirai-2.0 is treated as a pretrained decoder-only Transformer for long-horizon quantile forecasting of univariate time series, with instance normalization based on the first 30% of the window, patch embeddings via a residual block with SiLU activation, and a quantile head producing 3 quantiles (Alonso et al., 25 Jun 2026).
The theoretical framing in that benchmark presents pretraining as a data-dependent inductive prior. The central claim is that large-scale pretraining reduces effective model complexity in small-sample local regimes, which helps explain why zero-shot TSFM inference can outperform train-from-scratch high-capacity baselines without implying strong economic predictability. The paper further emphasizes information-theoretic limits, including the statement 4 for daily equity returns, and an operator-theoretic view in which attention kernels and spectral gap govern long-range information flow (Alonso et al., 25 Jun 2026).
The empirical protocol uses five liquid U.S. equities—AAPL, AMZN, GOOG, JPM, and META—with both linear and log returns, a horizon of 20 business days, 10 rolling-origin windows from August 2024 to February 2026, and an equalized lookback of 512 observations. Pretrained TSFMs are evaluated zero-shot, baselines such as NBEATS, NHITS, PatchTST, iTransformer, and KAN are fitted per ticker from scratch, and the primary metric is MAE together with a skill score relative to a random-walk benchmark (Alonso et al., 25 Jun 2026).
Under this protocol, Moirai-2.0 attains the best aggregate average rank of 2.9 across 10 tasks, with 3 task-level wins and 6 top-3 finishes. The best individual result is GOOG/log, where the model records MAE 5, skill 6, and a one-sided Diebold-Mariano 7-value of 8, the only Moirai-2.0 comparison that rejects equal or inferior predictive accuracy relative to the random walk at 9. Yet the broader conclusion is deliberately conservative: gains over the random-walk benchmark are small and sparse, and Moirai-2.0 is characterized as a useful practical prior rather than a universal engine for statistically reliable alpha generation (Alonso et al., 25 Jun 2026).
6. Separate usage in AI-mediated episodic prospection
A separate paper uses “Moirai 2.0” for an end-to-end platform designed to augment human prospective cognition by simulating multiple “fates” as vivid, interactive future-self avatars (Poonsiriwong et al., 5 Dec 2025). The system ingests personal decision scenarios and autobiographical data, synthesizes multimodal digital twins representing divergent life paths, and measures how those simulations shift real-world choice probabilities.
Its architecture is divided into five modules: a Decision Elicitation Interface, a Life-Story Interface, Age-Progressed AI and Voice Cloning, Future Memory Generation, and a Conversational Avatar Interface. The decision interface collects a consequential binary choice and current leaning, defining a pre-intervention probability distribution 0 with 1 initially. The life-story questionnaire elicits values, aspirations, and self-narrative across career, family, finances, lifestyle, and philosophy. The visual pipeline applies Google’s Nano Banana age-progression model followed by LivePortrait animation, and the voice pipeline uses ElevenLabs to create a neural voice clone. Future memories are generated by Anthropic Claude Sonnet 4.5 and are structured around evaluative, affective, and eudaimonic vividness (Poonsiriwong et al., 5 Dec 2025).
The intervention logic is organized by “fates”: control as guided imagination without avatars, a single-option condition, a balanced dual-option condition, and an expanded three-option condition including an algorithmically generated Option C. Participants converse with 0, 1, 2, or 3 avatars for 7 to 10 minutes, asking about satisfactions, regrets, relationships, and advice, with gentle prompting after 20 seconds of silence (Poonsiriwong et al., 5 Dec 2025).
The randomized controlled study includes 2 young adults aged 18 to 28. The reported effects are asymmetric. Single-option A yields 3 for 4 with 5, single-option B yields 6 with 7, the balanced dual-option condition yields 8 with 9, and the expanded three-option condition raises adoption of Option C to 20.0% versus 2.7% in control, with 0 and 1. A logistic regression for switching includes perceived persuasiveness, baseline agency, and condition indicators; the coefficient for the two-sided condition is approximately 2 with 3, corresponding to an odds ratio of about 4 relative to control. Vividness ratings are highest for evaluative vividness at 5 and eudaimonic vividness at 6, ahead of visual vividness at 7 and affective vividness at 8, with Friedman 9, 0, and Kendall’s 1 (Poonsiriwong et al., 5 Dec 2025).
The ethical framing centers on autonomy, persuasion, and contestability. The paper explicitly recommends balanced presentation as default, transparent framing that the simulation is hypothetical rather than predictive, mechanisms for users to flag or edit narratives, and agency-enhancing design that treats avatars as reflective partners rather than infallible oracles. In that literature, “Moirai 2.0” names not a forecasting model, but a multimodal decision-support system whose persuasive capacity itself becomes an object of study (Poonsiriwong et al., 5 Dec 2025).