N-BEATS-MOE: Adaptive Mixture-of-Experts Forecasting
- The paper introduces a Mixture-of-Experts extension that uses a gating network to dynamically weight stacked MLP blocks for enhanced forecasting.
- It employs adaptive selection of trend, seasonal, and identity components, improving interpretability and predictive performance.
- Empirical evaluations show improved SMAPE on heterogeneous datasets while highlighting a trade-off on homogeneous series.
N-BEATS-MOE is a Mixture-of-Experts extension of the N-BEATS deep learning architecture for time series forecasting, designed to enhance adaptability and interpretability across heterogeneous datasets. Building upon the backbone of stacked multilayer perceptron blocks and residual connections, N-BEATS-MOE introduces a dynamic expert weighting mechanism via a gating network, enabling the model to select and combine specialized temporal components (such as trend, seasonality, and identity) in a data-driven fashion. This approach addresses the challenge of varying time series dynamics and demonstrates improved performance, particularly for collections with diverse characteristics (Matos et al., 10 Aug 2025).
1. Architectural Principles and Model Formulation
N-BEATS-MOE retains the core structure of N-BEATS, in which prediction is generated through multiple stacks of MLP blocks. Each block models specific signal components through backcast and forecast outputs. The standard N-BEATS aggregation forms the final prediction as an unweighted sum,
where is the output of block . N-BEATS-MOE replaces this with a data-adaptive, softmax-weighted sum: where and is the normalized model input. The weights (output by the gating network) encode the relevance of each expert for the current series, summing to unity to ensure that the output remains probability-normalized. This enables the model to adaptively focus on experts most appropriate to the characteristics of each time series.
2. Gating Network Mechanism
The gating network is central to the Mixture-of-Experts formulation. It operates as an affine transformation (LINEAR) over the normalized input , followed by a softmax function over the expert dimension . This yields block-specific weights for each forecast output. The normalization of input via LayerNorm is employed to mitigate mode collapse and stabilize the softmax assignment. The gating mechanism serves two primary purposes:
- Adaptive Expert Selection: It dynamically ranks and weights the contributions of each block based on the input’s profile (e.g., trend-dominated, seasonal, or stationary series).
- Interpretability: The gating outputs provide direct information on the expert selected for each series and can be used to visualize and analyze the temporal component decomposition.
Empirical illustrations in the literature observe the gating outputs matching the underlying STL-decomposition components (trend, seasonality, residual), substantiating the interpretive claim.
3. Performance Across Heterogeneous and Homogeneous Datasets
N-BEATS-MOE has been benchmarked on 12 datasets (M1, M3, M4, Tourism) with various sampling frequencies and degree of heterogeneity:
Dataset | SMAPE (N-BEATS) | SMAPE (N-BEATS-MOE) | Key Finding |
---|---|---|---|
M1 (Yearly) | 10.87% | 9.76% | MoE improvement |
M4 (Yearly) | 13.45% | 13.31% | MoE improvement |
M3 (Monthly) | mixed | mixed | Some MoE advantage |
Tourism | -- | no improvement | No MoE benefit |
The observed pattern is that the Mixture-of-Experts schema displays a clear advantage on heterogeneous datasets, particularly those with mixed temporal patterns. On homogeneous datasets (such as Tourism, single-domain), the adaptive weighting provided by the gating network is not leveraged, occasionally leading to a small performance drop relative to the original N-BEATS.
4. Expert Specialization and Model Interpretability
A salient feature of N-BEATS-MOE is the capacity to relate each expert’s role to interpretable time series components:
- Expert Assignment: Softmax weights from the gating block indicate which expert (trend, seasonal, identity) was assigned highest relevance for a given series.
- STL-Validation: STL decomposition (Seasonal-Trend decomposition) experiments show correspondence between gating output and the actual signal decomposition. The model preferentially weights the expert mirroring the dominant time series component.
- Component Amplitude Adjustment: MoE-based models are able to attenuate irrelevant expert contributions, leading to more accurate and interpretable decompositions; e.g., mitigating overestimated trend amplitude that is present in vanilla N-BEATS.
Visualizations link the gating soft assignments to time series structure, facilitating post-hoc interpretability analysis with qualifiers such as “this series was weighted toward expert E.”
5. Comparison to Related Architectures and Alternative MoE Schemas
The paper contrasts N-BEATS-MOE with several variants:
- MoEBlock: Replaces the fully connected layer in each block with a Mixture-of-Experts, but does not outperform block-level gating.
- MoEShared: Adds a shared expert, offering more parameter sharing.
- MoEScaled: Assigns experts of varying parameter sizes to model complexity.
In summary, block/stack-level gating of N-BEATS forecasts is favored over other MoE placements. Classical seasonal naive forecasts serve as a shallow baseline, while original N-BEATS with even summation acts as the foundational comparator.
Model | Advantage | Disadvantage |
---|---|---|
N-BEATS-MOE | Best on heterogeneous datasets; interpretable gating | Mixed result on homogeneous |
MoEBlock | May over-complicate block computation | No clear gain |
MoEShared | Reduced specialization | Limited advantage |
MoEScaled | Parameter adaptation by complexity | Mixed effect |
A plausible implication is that the Mixture-of-Experts should be deployed where series diversity is high and component specialization is valuable.
6. Implications and Application Domains
The innovation of N-BEATS-MOE extends direct applicability to forecasting problems characterized by nonstationary dynamics, regime shifts, and heterogeneous time series domains. The interpretability and modular adaptivity of the gating network make the approach particularly relevant for financial, retail, climate, and energy forecasting tasks in which the dataset comprises mixed series types. The model’s success in benchmark competitions and generic domains substantiates its generalizability and transparency.
For tasks with homogeneous series, standard N-BEATS or simpler block architectures remain more effective, and the additional computational overhead for Mixture-of-Experts may not be warranted.
7. Limitations and Further Directions
The main limitation of N-BEATS-MOE is the diminished gain on single-domain (homogeneous) datasets and the risk of over-parametrization. Careful design of expert architectures and judicious regularization of the gating mechanism are required to avoid expert collapse (degenerate weighting) and to maintain computational tractability. Future research directions may explore stacking MoE layers at different levels of granularity, hybrid MoE approaches for improved cross-series generalization, or integration with domain-adaptation techniques such as optimal transport or stack-aligned feature maps (Lee et al., 2023).
In summary, N-BEATS-MOE is a principled extension of N-BEATS for boosting forecast accuracy and interpretability in the context of heterogeneous time series, employing a dynamic gating function to weight expert contributions adaptively and providing insight into time series component specialization. Its empirical superiority on diverse benchmarks and modular design mark it as an important development in the field of deep time series forecasting (Matos et al., 10 Aug 2025).