MoFE-Time: Frequency & Time Experts

Updated 3 July 2026

MoFE-Time is a mixture-of-experts architecture that integrates specialized time and frequency-domain modules within a Transformer to capture both local temporal and global periodic patterns.
It employs a pretraining–finetuning paradigm combined with sparse expert routing to efficiently model complex temporal dependencies and harmonic structures across diverse datasets.
Empirical evaluations reveal state-of-the-art forecasting performance with significant error reductions compared to established baselines on multiple time series benchmarks.

MoFE-Time refers to a family of Mixture-of-Experts (MoE) architectures that integrate specialized time-domain and frequency-domain expert modules within a Transformer framework to achieve state-of-the-art performance in time series forecasting. The MoFE-Time approach combines pretraining-finetuning paradigms with structured expert decompositions, enabling highly expressive yet computationally efficient models that are capable of modeling both local temporal dependencies and global periodic or harmonic structures. Initially introduced in "MoFE-Time: Mixture of Frequency Domain Experts for Time-Series Forecasting Models" (Liu et al., 9 Jul 2025), the concept has since influenced the design of a wide range of extreme-adaptive, frequency-aware, and sparse expert models for multivariate time series and related tasks.

1. Architectural Foundation: Frequency-Time Experts and Sparse Mixtures

MoFE-Time builds upon the Transformer encoder–decoder architecture augmented with sparse Mixture-of-Experts layers. After each multi-head attention block, MoFE-Time introduces a MoE module where each expert is a Frequency-Time Cell (FTC), consisting of two parallel submodules per expert:

Time-cell path: Implements point-wise transformations followed by Swish gating and dilated depthwise convolution. This branch is optimized for learning local temporal motifs in the sequence.
Frequency-cell path: Projects the input to learned harmonic bases, synthesizes parametrically generated sinusoidal responses, and returns frequency-aware embeddings. Each frequency-cell expert specializes on different bands (frequencies) and adapts via learnable weights.

Given a token-wise hidden state $x \in \mathbb{R}^h$ at a given layer, the MoE router computes an N-way sparse top- $k$ gating vector $g(x)$ via linear and softmax activations, assigning $x$ to a combination of $k$ experts. The final output is aggregated as $\hat{y}(x) = \sum_{i=1}^N g_i(x) E_i(x)$ , where $E_i(x)$ is the expert output. The entire MoFE-Time stack thus constructs multidimensional, frequency- and time-sensitive sparse representations at every depth (Liu et al., 9 Jul 2025).

2. Mathematical Structure of Expert Transformations and Routing

Time-cell Transformation:

Given hidden state $E_j$ ,

$Z_j^W = W E_j \quad;\quad Z_j^V = V E_j \quad;\quad \alpha = \mathrm{sigmoid}(\beta)$

$Y_j^{\text{gate}} = \alpha \cdot \mathrm{swish}(Z_j^W) + (1-\alpha) Z_j^V$

Apply dilated depthwise convolution (kernel size $k$ 0, dilation $k$ 1), then linear:

$k$ 2

$k$ 3

Frequency-cell Transformation:

Given input $k$ 4 and per-expert frequency set $k$ 5,

$k$ 6

$k$ 7

A linear projection merges the [cos, sin] outputs into frequency-aware representations.

Routing:

Given $k$ 8 (token hidden state),

$k$ 9

A sparse top- $g(x)$ 0 selection with softmax normalization yields the MoE routing vector $g(x)$ 1, used to weight and aggregate the FTC outputs.

3. Pretraining–Finetuning Paradigm and Data Regimes

MoFE-Time is trained under a two-stage pretraining–finetuning regime:

Pretraining: Conducted on Time-300B, a large-scale heterogeneous corpus comprising approximately 300 billion time points from domains such as energy, retail, finance, weather, synthetic, and transportation. The objective is masked auto-regressive forecasting over variable-length context and prediction splits, with an emphasis on learning both common and rare periodic structures across domains.
Finetuning: Transfer to downstream benchmarks and proprietary datasets is achieved via a single-epoch adaptation of the gating network and output heads while retaining the harmonic bases and temporal filters learned during pretraining (Liu et al., 9 Jul 2025).

This paradigm enables rapid adaptation to new domains with differing periodicity distributions and improves zero-shot and few-shot forecasting performance by leveraging large-scale periodic priors.

4. Sparse, Multidimensional Representation Capacity

The use of MoE routing yields $g(x)$ 2-sparse mixtures per token, with each FTC specializing on a narrow subset of frequencies and time patterns. Across multiple Transformer layers, this produces a highly sparse, multidimensional representation tensor: $g(x)$ 3, with non-zero entries marking which frequency/time experts are active. This explicit specialization enables efficient modeling of complex periodic phenomena without dense parameter coupling.

Empirical analyses demonstrate that removing frequency experts or disabling this sparse specialization significantly degrades forecasting accuracy, particularly in datasets with rich harmonic structure or varying statistical shifts (Liu et al., 9 Jul 2025).

5. Training Objective, Optimization, and Implementation

The MoFE-Time model is optimized by minimizing the sum of:

Forecasting loss: Standard Huber loss (for stability on outliers) over predicted horizons,
MoE load-balance auxiliary loss: To avoid routing-collapse and encourage utilization of all experts.

Typical implementation details include:

Adam optimizer, learning rates of $g(x)$ 4 for pretraining and $g(x)$ 5 for finetuning,
Sparse MoE gating with top- $g(x)$ 6 selection, typically $g(x)$ 7,
Reversible instance normalization (RevIN) for non-stationarity robustness,
Model sizes scaled up to 2.4 billion parameters for large foundation variants; the gating mechanism ensures that only a small expert subset is active per token, containing inference costs.

The full process is defined in an explicit pseudocode loop over context normalization, embedding, repeated multi-head attention and MoE-FTC blocks, and output denormalization (Liu et al., 9 Jul 2025).

6. Empirical Evaluation and Comparative Performance

Evaluation on public long-term forecasting benchmarks (ETTh1/2, ETTm1/2, Weather, Exchange Rate) and the NEV-sales dataset establishes MoFE-Time as SOTA on most settings. Quantitative highlights include:

Model	Avg. MSE (pub)	Avg. MAE (pub)	NEV-sales MSE	NEV-sales MAE	Reduction over Time-MoE (MSE/MAE)
MoFE-Time	0.2755	0.3226	0.1956	0.3284	6.95% / 6.02% (pub), 18.6% (NEV)
Time-MoE	0.2961	0.3433	0.2405	0.3628	—

On 39 out of 42 dataset–horizon combinations, MoFE-Time ranks first. Ablations show a 4.2% average MSE increase without frequency specialists, 2–12% degradation if pretraining or normalization is omitted (Liu et al., 9 Jul 2025).

7. Strengths, Limitations, and Application Scope

Strengths:

Jointly models time-local and frequency-periodic patterns end-to-end, learning harmonic bases and temporal filters without ad hoc preprocessing.
Sparse MoE reduces inference cost while allowing for parameter scaling.
Substantial robustness to domain shifts, non-stationarity, and variations in periodicity via pretraining and RevIN.
Empirical gains over established baselines such as Time-MoE, TimeMixer, PatchTST, and AutoFormer.

Limitations:

The balance between frequency and time capacity per expert and choice of $g(x)$ 8 requires careful hyperparameter tuning.
Mixture weights, while structured, may hinder transparent model interpretability in cases of expert overlap.
For highly irregular or event-driven series, additional preprocessing (e.g., interpolation) may be required.

Application domains include energy load and demand forecasting, retail and sales with periodic cycles, financial time series with harmonic structure, sensor and IoT prediction, and commercial demand estimation as exemplified by NEV-sales (Liu et al., 9 Jul 2025).

References

"MoFE-Time: Mixture of Frequency Domain Experts for Time-Series Forecasting Models" (Liu et al., 9 Jul 2025)
"Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts" (Shi et al., 2024)
"M $g(x)$ 9FMoE: Multi-Resolution Multi-View Frequency Mixture-of-Experts for Extreme-Adaptive Time Series Forecasting" (Huang et al., 13 Jan 2026)