Time-MoE: Scalable Time Series Modeling

Updated 26 March 2026

Time-MoE is a scalable neural model that uses sparse Mixture-of-Experts layers within transformer backbones to dynamically route time series inputs to specialized expert subnetworks.
It improves efficiency by activating only a few experts per input, thereby reducing computational cost while effectively addressing non-stationarity and diverse signal modalities.
The architecture supports various routing strategies—including token-wise, segmental, and multi-modal approaches—enabling advanced applications from astronomical analysis to industrial forecasting.

A Time-MoE architecture is a family of scalable neural models for time series that leverage sparse Mixture-of-Experts (MoE) layers within transformer backbones. The approach allows per-instance, per-token, or per-segment dynamic routing to specialized expert subnetworks, yielding high model capacity without linear growth in computational cost. This conditional sparsity—activating only a subset of experts for each input—addresses both heterogeneity (non-stationarity, diverse periodicities, modality integration) and efficiency (scalability, parameter utilization) in time-series modeling across a range of domains, from astronomical light curves to billion-scale industrial forecasting. Time-MoE models now underpin several state-of-the-art time series foundation models and extend naturally to multi-modal and segmental settings (Shi et al., 2024, Liu et al., 2024, Cádiz-Leyton et al., 16 Jul 2025, Liu et al., 9 Jul 2025, Zhang et al., 29 Jan 2026, Ortigossa et al., 29 Jan 2026).

1. Architectural Principles and Variants

The core of Time-MoE is the integration of sparse MoE layers within a (typically transformer-based) forecasting model. Several representative instantiations have emerged:

Token-wise MoE: Each time step or input token passes through a gating network that selects $k$ out of $N$ experts (feed-forward subnets)—as in Time-MoE (Shi et al., 2024), Moirai-MoE (Liu et al., 2024), and MoFE-Time (Liu et al., 9 Jul 2025). The selection is based on token contents and, optionally, context.
Segment-wise (Segmental) MoE: The model groups contiguous tokens into segments; routing and processing happen at the segment level, letting experts directly model intra-segment dependencies (Seg-MoE (Ortigossa et al., 29 Jan 2026)).
Band- or Channel-specific MoE: Inputs with structural heterogeneity (e.g., multiband astronomical data) utilize expert gating by photometric band or measurement channel (Astro-MoE (Cádiz-Leyton et al., 16 Jul 2025)).
Multi-Modal MoE: Gating and expert outputs can be modulated by auxiliary modalities, such as textual context, to fuse multi-modal signals in a conditional manner (MoME (Zhang et al., 29 Jan 2026)).
Frequency-Time MoE: Some experts learn frequency-domain representations; others operate in the time domain, with gating across both (MoFE-Time (Liu et al., 9 Jul 2025)).

Key MoE features are included in the table below.

Variant	Routing Granularity	Modality Support
Time-MoE	Token	Uni-modal (time series)
Moirai-MoE	Token	Uni-modal
Seg-MoE	Segment (window)	Uni-modal
Astro-MoE	Band-token	Multiband, time series
MoME	Token, multi-modal	Time series + text
MoFE-Time	Token	Time/frequency hybrid

2. Sparse MoE Layer Design

All Time-MoE architectures replace standard dense feed-forward sublayers in transformers with MoE layers. The general operational sequence:

Gating Network: For input $x\in\mathbb{R}^{d_{in}}$ , compute scores $g(x)$ via $g(x)=W_g x + b_g \in \mathbb{R}^N$ . Apply Top- $k$ sparsity, zeroing all but largest $k$ scores. Normalize remaining scores via a softmax for probabilistic weighting.
Expert Computation: Only the selected $k$ experts perform their specific non-shared transformations, e.g., $E^{(e)}(x) = W_2^{(e)} \phi(W_1^{(e)} x + b_1^{(e)}) + b_2^{(e)}$ with $\phi$ typically GeLU.
MoE Output: Aggregate as $\mathrm{MoE}(x) = \sum_{e=1}^{N} G(x)_e E^{(e)}(x)$ , where $G(x)_e$ are the nonzero gating weights corresponding to selected experts (Shi et al., 2024, Liu et al., 2024, Cádiz-Leyton et al., 16 Jul 2025).
Load-Balancing Auxiliary Loss: To prevent expert collapse, an auxiliary penalty such as $L_{\mathrm{aux}} = \lambda \sum_{e=1}^{N} p_e f_e$ is added, where $p_e$ is the average selection probability and $f_e$ is the usage fraction for expert $e$ .

Segment-wise variants (Seg-MoE) perform gating on flattened multi-token segments, and often include a always-on shared expert per segment (Ortigossa et al., 29 Jan 2026). Multi-modal variants modulate both gating and expert outputs by auxiliary signals (e.g., text embeddings) (Zhang et al., 29 Jan 2026).

3. Input Encoding and Gating Strategies

Time-MoE models are designed to adapt routing and expert computation to non-stationary, multi-scale properties of time series:

Band- and Channel-Specific Encoding: For multiband data, such as astronomical light curves, input tokens are formed as $(\text{flux},\ \sigma)$ per band and concatenated (Cádiz-Leyton et al., 16 Jul 2025).
Segmentation: Long series are patched or windowed into fixed-size segments $[x_{t}, x_{t+1}, ..., x_{t+P-1}]$ , where segment-level gating is employed (Ortigossa et al., 29 Jan 2026).
Temporal/Positional Encoding: Strategies include fixed sinusoidal encodings and learnable band-specific time modulation, whereby a per-band Fourier series is adaptively combined with learnable MoE functions (Cádiz-Leyton et al., 16 Jul 2025).
Causal Normalization: For highly non-stationary streams, normalization is applied using sliding windows to ensure distributional stability for each patch (Liu et al., 2024).
Gating Specialization: Routing can be standard linear, clustering-based (e.g., using k-means on frozen token embeddings for more semantically coherent partitions (Liu et al., 2024)), or modulated by auxiliary modalities (e.g., text (Zhang et al., 29 Jan 2026)).

A direct implication is that Time-MoE achieves fine-grained, context-adaptive specialization, crucial for domains with non-homogeneous or dynamically evolving statistical regimes.

4. Specialized Experts: Frequency, Time, and Modality

Recent architectures employ advanced expert types:

Frequency-Time Cells: MoFE-Time implements each expert as a dual-branch subnetwork with time-domain (dilated convolution) and frequency-domain (learned harmonics via complex exponential) pathways. This hybrid captures both periodicity and local behavior, substantially improving predictive accuracy for signals with mixed periodic structures (Liu et al., 9 Jul 2025).
Expert Modulation by Auxiliary Signal: In MoME, each expert's computation and the router itself is modulated by contextual embeddings distilled from auxiliary modalities (e.g., language), enabling adaptive cross-modal control (Zhang et al., 29 Jan 2026).
Segmental Experts: In Seg-MoE, experts model whole segments, better matching the local continuity observed in real-world series and producing improved extrapolations (Ortigossa et al., 29 Jan 2026).

This modularity facilitates knowledge transfer, as experts can encode reusable priors (e.g., periodic basis functions) across datasets during pretraining and be finetuned to downstream tasks.

5. Training Paradigms, Losses, and Efficiency

Time-MoE training generally proceeds in two stages:

Large-scale Pretraining: Autoregressive or masked reconstruction objectives are used on massive corpora such as Time-300B (≈309B points across 9 domains), with auxiliary MoE load-balancing loss added to prevent collapse (Shi et al., 2024, Liu et al., 9 Jul 2025). Losses include Huber for regression and cross-entropy for classification/forecasting. Reconstruction-based self-supervision (e.g., mask-and-reconstruct for astronomical light curves) is also common (Cádiz-Leyton et al., 16 Jul 2025).
Task-specific Finetuning: The full backbone–including all experts and gating parameters–is finetuned on supervised targets, often with reduced learning rates. The same combined task plus auxiliary loss is retained (Liu et al., 9 Jul 2025).

Computationally, only a fixed small number ( $k \ll N$ ) of expert paths are activated per token, so per-token inference and training FLOPs scale with $k$ , not $N$ . Activated parameter counts are matched to baseline dense models, while total capacity grows linearly with the expert pool (Liu et al., 2024, Shi et al., 2024).

6. Empirical Performance and Scaling Laws

Time-MoE models demonstrate substantial empirical gains across standard benchmarks:

Benchmark Accuracy: Time-MoE (2.4B params) achieves $MSE\approx0.322$ on zero-shot benchmarks versus $0.359$ for the best dense baseline; one-epoch finetuning yields $MSE\approx0.301$ (Shi et al., 2024). MoFE-Time reduces MSE and MAE by $6.95\%$ and $6.02\%$ over baseline Time-MoE on public datasets (Liu et al., 9 Jul 2025). Seg-MoE yields state-of-the-art average MSE on 6/7 datasets, with gains pronounced at long horizons (Ortigossa et al., 29 Jan 2026).
Scaling Behavior: Both parameter and data scaling laws are observed—more data and/or more experts (with activated params and FLOPs fixed) yield consistently lower error (Shi et al., 2024, Liu et al., 2024). Moirai-MoE and Time-MoE outperform dense comparators by up to $17\%$ in normalized MAE and $28\times$ fewer activated parameters than larger dense models.
Specialization Analysis: Ablations show gains arise primarily from sparse MoE specialization and fine-grained routing; segmental and hybrid (frequency–time) experts further boost performance in non-stationary or multi-periodic regimes (Ortigossa et al., 29 Jan 2026, Liu et al., 9 Jul 2025).
Inference Efficiency: Computational cost per token is reduced by $39\%$ relative to traditional dense FFNs. Peak GPU cost grows modestly with expert count and segment size (Shi et al., 2024, Ortigossa et al., 29 Jan 2026).

7. Extensions and Future Directions

Time-MoE is now a unifying architecture for scalable time series modeling, with documented successful applications in:

Astronomical discovery pipelines (Astro-MoE), financial/industrial foundation models (Time-MoE, Moirai-MoE), and multi-modal forecasting (MoME).
Segmental and multi-resolution modeling (Seg-MoE), exploiting the contiguity and hierarchical structure of real sequences.
Advanced hybrid experts (frequency-time cells) (Liu et al., 9 Jul 2025).

Ongoing research avenues include adaptive segment sizing (learned segment window $\omega_b$ ), dynamic expert sets (combining FFNs, convolutions, or recurrent cells), and large-scale multi-modal pretraining. The modularity and conditional sparsity of Time-MoE architectures provide a promising direction for future work in universal sequence modeling and cross-domain foundation models (Ortigossa et al., 29 Jan 2026, Shi et al., 2024, Zhang et al., 29 Jan 2026).