Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Time-MoE: Scalable Forecasting Model

Updated 30 June 2025
  • Time-MoE is a scalable foundation architecture that integrates sparse Mixture-of-Experts with autoregressive Transformers for efficient and flexible time series forecasting.
  • It decouples model capacity from computational cost by activating only top-K experts per token, achieving state-of-the-art performance with sublinear inference growth.
  • Pretrained on the billion-scale Time-300B dataset, Time-MoE delivers robust zero-shot and fine-tuned forecasting across diverse, multivariate time series domains.

Time-MoE is a scalable foundation architecture for time series forecasting that leverages sparse Mixture-of-Experts (MoE) within autoregressive Transformers to combine the accuracy and flexibility of large-scale pretraining with major gains in computational and inference efficiency. Designed to address the unique challenges of multivariate, multi-domain time series—such as large scale, distribution diversity, and variable temporal horizons—Time-MoE advances the state-of-the-art by providing both a billion-parameter universal model and a billion-scale pretraining dataset (Time-300B), while keeping computational cost sublinear in model size.

1. Model Architecture and MoE Design

The Time-MoE architecture is a decoder-only Transformer that replaces the standard dense feed-forward network (FFN) sublayer in each transformer block with a sparse MoE layer. The model structure includes the following key components:

  • Input Token Embedding: Raw time series values are tokenized pointwise and passed through a SwiGLU embedding.
  • Stacked MoE Transformer Blocks: Each block consists of causal self-attention (with RMSNorm) followed by a Mixture-of-Experts FFN sublayer. The MoE FFN layer for layer ll and token tt is:

Mixture(uˉtl)=gN+1,tFFNN+1(uˉtl)+i=1Ngi,tFFNi(uˉtl)\operatorname{Mixture} ( \mathbf{\bar{u}}_t^{l} ) = g_{N+1,t} \operatorname{FFN}_{N+1}( \mathbf{\bar{u}}_t^{l} ) + \sum_{i=1}^{N} g_{i,t} \operatorname{FFN}_i( \mathbf{\bar{u}}_t^{l} )

where NN is the number of (non-shared) experts, gi,tg_{i,t} are data-dependent gating weights (nonzero for the top-KK experts), and FFNN+1FFN_{N+1} is a global "shared expert" gated via a sigmoid.

  • Multi-Resolution Forecasting Head: Supports multiple forecast horizons (1,8,32,64{1, 8, 32, 64}, etc.) with PP output projections, allowing the model to flexibly generate forecasts for arbitrary time horizons in a single forward pass.

Token-to-expert assignment is determined by a Top-KK sparse softmax gating network, and only the selected experts for each token are activated, making computation sparse and highly efficient.

2. Scalability and Efficiency

Time-MoE is explicitly designed to decouple model capacity from inference/training cost, addressing the efficiency constraint that has historically limited the scale of time series foundation models:

  • Sparse Activation: Only KK of NN experts per layer are activated for each input token, yielding sublinear computational growth. For example, the "ultra" version has 2.4B total parameters, but only 1.1B are activated per prediction.
  • Load Balancing: An auxiliary loss (as in Switch Transformer and GShard) is used to avoid "expert collapse" and encourage balanced token distribution across experts.
  • Sublinear Increase of Inference Cost: Empirically, training cost is reduced by an average 78%, and inference cost by 39%, versus dense models with equivalent capacity. Inference time grows sublinearly with the number of experts, as shown in ablation studies.

Efficient scaling is verified by monotonic improvements in forecasting metrics with increasing model size and training data (scaling law), mirroring results seen in NLP and vision domains.

3. Pre-Training Pipeline and the Time-300B Dataset

Time-MoE models are pretrained on Time-300B, a dataset constructed to maximize generalization and cross-domain capability:

  • Scale and Coverage: Time-300B consists of over 309 billion time points from 48 million sequences, covering nine major domains (energy, finance, retail, healthcare, weather, transportation, web, synthetic, and others) and all frequencies from seconds to annual.
  • Balanced Sampling: The data pipeline handles missing values, removes invalid readings, and balances domains through batch sampling and strategic downsampling of dominant sources, ensuring the model is not biased towards any domain.
  • Training Regimen: Pretraining is performed for 100k steps with batch size 1024 and maximum sequence length 4096, using the AdamW optimizer and bf16 precision, on 128 × A100-80G GPUs.
  • Scaling Law Validation: Empirical results confirm that both increasing model size and increasing data scale improve forecasting precision, validating the applicability of modern scaling laws to time series forecasting.

4. Forecasting Performance and Empirical Results

Time-MoE achieves state-of-the-art accuracy on a range of standard and challenging forecasting tasks:

  • Zero-shot Forecasting: On datasets such as ETTh1/2, ETTm1/2, Weather, Global Temp, Time-MoE "ultra" (2.4B) obtains an average MSE of 0.322, outperforming prior SOTA (Moirai_large 0.349, TimesFM 0.396, Chronos_large 0.428). This represents a relative error reduction of 20–30% compared to dense and MoE-based models.
  • In-distribution Fine-tuning: After 1 epoch of supervised fine-tuning, the model further reduces MSE by ~24% on average over dense SOTA baselines.
  • Comparison to Dense Models: Time-MoE with the same number of activated parameters (thus the same inference/training cost) outperforms dense models by a substantial margin.
  • Ablation Studies: Replacing MoE FFN with a dense FFN increases average MSE by 4%. Removing either the multi-resolution output head or load balancing degrades both performance and computational efficiency.
  • Inference Speed: Inference latency increases sublinearly as the number of experts grows and remains practical for high-throughput and real-time deployments.

5. Applications and Capabilities

Time-MoE is architected as a universal and robust forecasting backbone suitable for real-world deployment:

  • Universal Forecasting: Able to process univariate and multivariate series, any context length, and flexible output horizons.
  • Domains: Energy systems (demand/supply forecasting), climate/weather (short- and long-range), finance (asset prediction, risk), healthcare (epidemic monitoring), retail/sales/demand, and transportation.
  • Practical Advantages:
    • Scalability: Supports billion-scale model pretraining and efficient inference on commodity GPUs.
    • Generalization: Excels at zero-shot and few-shot tasks, robust to out-of-distribution scenarios.
    • Data-agnostic: Handles data of diverse sampling rates, missing values, and domain imbalances.
    • Open Source: Both the model family and the Time-300B dataset are publicly available.

6. Implications and Significance

Time-MoE demonstrates that mixture-of-experts scaling principles, previously successful in NLP and vision, transfer to the domain of time series forecasting, but require careful architectural adaptation (sparse MoE, flexible forecasting heads, domain-balanced pretraining).

The practical impact includes:

  • Industrial Readiness: Delivers highly accurate and efficient forecasting at an industrial scale, making billion-parameter models accessible for real-world planning and analytics applications.
  • Extensibility: Its modular, sparsely activated structure supports adaptation to missing data regimes, multi-horizon outputs, and the addition of new expert subnetworks.
  • Research Foundation: Establishes both an open model and a massive-scale, multi-domain dataset for further research in time series foundation models.

The approach directly addresses long-standing challenges in scaling, efficiency, and cross-domain robustness for time series modeling, setting a new benchmark for both methodological rigor and practical applicability in large-deep forecasting systems.