Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers

Published 29 Jan 2026 in cs.LG and cs.AI | (2601.21641v1)

Abstract: Transformer-based models have recently made significant advances in accurate time-series forecasting, but even these architectures struggle to scale efficiently while capturing long-term temporal dynamics. Mixture-of-Experts (MoE) layers are a proven solution to scaling problems in natural language processing. However, existing MoE approaches for time-series forecasting rely on token-wise routing mechanisms, which may fail to exploit the natural locality and continuity of temporal data. In this work, we introduce Seg-MoE, a sparse MoE design that routes and processes contiguous time-step segments rather than making independent expert decisions. Token segments allow each expert to model intra-segment interactions directly, naturally aligning with inherent temporal patterns. We integrate Seg-MoE layers into a time-series Transformer and evaluate it on multiple multivariate long-term forecasting benchmarks. Seg-MoE consistently achieves state-of-the-art forecasting accuracy across almost all prediction horizons, outperforming both dense Transformers and prior token-wise MoE models. Comprehensive ablation studies confirm that segment-level routing is the key factor driving these gains. Our results show that aligning the MoE routing granularity with the inherent structure of time series provides a powerful, yet previously underexplored, inductive bias, opening new avenues for conditionally sparse architectures in sequential data modeling.

Abstract PDF Upgrade to Chat

Summary

The paper proposes a segment-wise routing mechanism that groups contiguous time steps, enhancing local temporal modeling in forecasting.
It leverages multi-resolution segment schedules and a shared expert to balance prediction accuracy with computational efficiency.
Empirical results on diverse benchmarks demonstrate superior performance over traditional token-wise MoE approaches.

Segment-wise Mixture-of-Experts for Transformers in Time Series Forecasting

Introduction

"Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers" (2601.21641) addresses the longstanding challenge of efficiently scaling Transformer architectures for long-range, multivariate time series forecasting. While sparse Mixture-of-Experts (MoE) layers have proven effective for scaling in NLP and vision applications, their extension to time series has been limited by token-wise routing granularity, failing to leverage the temporal continuity intrinsic to sequential data. This work proposes Seg-MoE, a conditional sparsity mechanism that routes contiguous segments—rather than individual tokens—through expert subnetworks, tightly aligning the routing paradigm with the locality structure of real-world time series.

Transformer-based models have achieved substantial advances in sequence modeling, owing primarily to their parallelizability and capacity for modeling long-range dependencies. However, for long or high-resolution time series, quadratic computational overhead and memory consumption of attention mechanisms impede their practical deployment. Various modifications—such as sparse attention, patching, and channel independence—have been introduced to mitigate computational bottlenecks. Sparse MoE layers, which divide computation across several expert FFNs activated on a routing basis, have enabled parameter scaling in LLMs without proportional increases in inference cost [shazeer2017outrageously, fedus2022switch].

Despite this, standard MoE layers rely on independent token-wise gating, resulting in experts seldom specializing on local temporal features or recurring subsequences. In the context of time series, where patterns (e.g., oscillations, cycles, regime changes) manifest over contiguous intervals, expert fragmentation can diminish model fidelity and hinder accurate extrapolation. Previous efforts such as Time-MoE [shi2024time] and Moirai [liu2024moirai] have introduced MoE for time series, but continue to apply token-level gating, leaving the structured local dependencies underexploited.

Seg-MoE Architecture

Seg-MoE introduces a segment-wise routing and processing paradigm. Instead of routing each token (time step) independently, the input sequence is partitioned into contiguous, non-overlapping segments. Each segment is embedded, optionally flattened, and routed jointly to a subset of $K$ out of $N$ available expert networks according to a trainable gating mechanism. Each expert operates over the entire segment, capturing intra-segment dependency, enabling the model to internalize local motifs, dynamic structure, and abrupt transitions.

Critical architectural features include:

Segment Construction: Letting segment size (resolution $\omega$ ) be a layer-specific or global hyperparameter, with support for multi-resolution segment schedules across the Transformer depth, providing temporal hierarchy.
Global Shared Expert: Each segment always traverses a shared expert (with a trainable gating weight), ensuring stability and alleviating degenerate routing.
Auxiliary Routing-Balance Loss: Supplementing the primary forecast loss with a regularization term penalizing imbalanced expert usage, mitigating collapse where a minority of experts are overutilized.
Multi-Resolution Routing: Empirically, a schedule of coarse and fine segmentations across layers improves robustness to heterogeneous dynamics found across domains and datasets.
Figure 1: Mixture-of-Experts (MoE) design. (a) Standard token-wise routing: each token independently activates experts. (b) Seg-MoE: entire time-step segments are routed as a unit, with both routed and shared experts.

Empirical Evaluation

The authors evaluate Seg-MoE extensively across seven standardized multivariate time series forecasting datasets, including challenging benchmarks such as ETTh1/2, ETTm1/2, Weather, ECL, and Traffic. Metrics include mean squared error (MSE) and mean absolute error (MAE) over multiple forecast horizons ( $H \in \{96, 192, 336, 720\}$ ).

Key observations:

Superior Predictive Performance: Seg-MoE achieves new state-of-the-art results, outperforming both dense Transformer models and token-wise MoE baselines by substantial margins. Gains are pronounced at long forecast horizons, confirming the value of segment-level expertise in temporal extrapolation.
Robust Expert Specialization: Segment-wise routing leads to meaningful expert specialization, fostering increased diversity in learned transformations and higher-fidelity local modeling.
Multi-Resolution Benefits: Ablations indicate that combining a range of segment resolutions across layers best accommodates datasets with varying degrees of periodicity and local/global structure.
Figure 2: Memory footprint during training across segment resolutions. Higher segment resolution ( $\omega > 1$ ) only modestly increases memory over token-wise MoE ( $\omega=1$ ), maintaining scalability.

Ablation and Efficiency

Comprehensive ablation studies demonstrate that:

Segment-wise routing consistently outperforms token-wise routing (i.e., setting $\omega=1$ ) for a range of $N$ , $K$ , and segment resolutions. Notably, no single segment size is optimal for all benchmarks—underscoring the need for multi-resolution architectures.
Training memory costs remain comparable between Seg-MoE and standard MoE for moderate segment sizes.

Practical and Theoretical Implications

Seg-MoE highlights the importance of aligning architectural sparsity mechanisms with the native structure of temporal data. The segment-wise inductive bias enables efficient capacity scaling while enhancing robustness and fidelity for both short- and long-range dynamics. From a practical perspective, the sparsity and expert modularization of Seg-MoE unlock the deployment of larger, deeper models in resource-constrained forecasting applications (e.g., energy, traffic, climate), with higher accuracy per compute.

Theoretically, this architectural principle—i.e., matching the granularity of sparsity/routing with salient compositional structure—may be generalized to other sequential domains, pointing toward domain-tailored conditional computation in sequence modeling.

Future Directions

The Seg-MoE paradigm suggests several avenues for further investigation:

Adaptive Segment Resolution: Learning segment sizes or adapting them dynamically during training/inference could further enhance representational flexibility.
Expert Diversity: Integrating heterogeneous expert architectures (e.g., convolutional, recurrent) may allow the model to capture a broader set of temporal phenomena.
Zero-Shot/Pre-trained Foundation Models: Scaling Seg-MoE with large-scale pre-training may enable universal forecasting models with strong domain transfer properties.
Application to Anomalous, Irregular, or Nonstationary Series: Extending to highly irregular or event-driven sequential data, such as medical signals, could validate the generality of the segment-wise sparsity principle.

Conclusion

Seg-MoE represents a significant advancement in the modeling of time series with Transformers by introducing segment-level conditional sparsity aligned with temporal locality. Through extensive empirical study, segment-wise routing demonstrates clear advantages over token-level MoE architectures for long-term, multivariate forecasting across domains. The inductive bias inherent in Seg-MoE opens a new direction for the design of efficient and specialized sequence models, with potential impact extending beyond time series forecasting to general sequential data processing.