MambaTS: Scalable Time Series Modeling
- MambaTS is a lineage of hardware-oriented selective structured state space models that offer linear scaling and robust long-term multivariate forecasting.
- It incorporates innovations like variable scan techniques, bidirectional encoding, and selective dropout to enhance cross-channel dependency modeling and reduce overfitting.
- Empirical evaluations in finance, weather, and embedded applications confirm its efficiency and state-of-the-art performance with reduced computational cost.
MambaTS refers to a lineage of time series models based on Mamba, a hardware-oriented, linear-time selective structured state space model (SSM) architecture. MambaTS and its descendants achieve state-of-the-art long-term sequence forecasting accuracy with provably linear scaling in time and memory. Subsequent model families (including TSMamba, Bi-Mamba+, DTMamba, MAT, HIGSTM, and industrial-scale hybrids such as Hunyuan-TurboS) have incorporated enhancements for channel-mixing, bidirectional encoding, variable-order scanning, uncertainty quantification, and hierarchical or hybrid attention. MambaTS research establishes selective SSMs as robust, scalable alternatives to Transformer-based frameworks for multivariate time series, with broad application in scientific forecasting, finance, and embedded/edge settings.
1. Foundations: Mamba State Space Models for Time Series
At its core, Mamba is a discretized linear time-invariant (LTI) state space model parametrized by input-adaptive gates, enabling parallel, hardware-friendly scanning. The general continuous- and discrete-time forms are:
where the discretization () may be made input-token-dependent via learned selection MLPs. Compared to self-attention (quadratic cost), a selective SSM scan can be implemented as: with overall time and memory complexity , where is sequence length, the feature dimension, and the (small) state dimension (Cai et al., 2024, Zou et al., 2024).
Mamba's recurrence and parameter selection are implemented in a GPU-parallel prefix-sum scan, and its inductive bias is a structured exponential kernel distinct from dot-product self-attention.
2. Architectural Innovations in MambaTS
MambaTS introduces targeted improvements for multivariate long-term sequence forecasting (LTSF), designed to mitigate baseline Mamba’s limitations in extensive variable modeling and scan-order sensitivity:
- Variable Scan along Time (VST): Input channels are patch-embedded and then interleaved in the scan, presenting the model with variable-mixed context at each step. For channels and history split into patches,
This enables cross-channel dependencies within hardware-efficient linear scans (Cai et al., 2024).
- Temporal Mamba Block (TMB): Causal convolution is removed; instead, a feedforward MLP and gating mechanism precede the SSM scan,
This design reduces overfitting and enhances long-range information capture.
- Selective Parameter Dropout: Dropout is applied to selective parameter inputs () to mitigate overfitting, with optimal rates in .
- Variable Permutation Training (VPT): On each training batch, variable order is randomly permuted, and the order is reversed post-prediction, enforcing permutation invariance and robust variable interaction learning.
- Variable-Aware Scan along Time (VAST): During training, a running adjacency matrix encodes scan-order edge “costs,” updated according to batch MSE, and an ATSP solver (simulated annealing) is used at inference to select the lowest-cost variable path.
Ablation studies confirm cumulative improvements per component, with final SOTA accuracy and throughput on benchmarks such as ETTm2, Weather, and Traffic (Cai et al., 2024).
3. Extensions: Channel Relations, Bidirectionality, and Hybrid Blocks
Several model families extend MambaTS’s expressiveness and efficiency:
- TSMamba (Ma et al., 2024): Employs dual (forward+backward) Mamba encoders, channel-wise normalization, and patch-wise representation. During pretraining and zero-shot deployment, channel independence is assumed. For fine-tuning on multivariate targets, a compressed channel-wise attention (CCA) module recovers cross-channel relations with negligible overhead, scaling as .
- Bi-Mamba+ (Liang et al., 2024): Adds a learnable forget gate to each Mamba block, blending SSM output and current features, and fuses a forward and backward pass. A series-relation-aware (SRA) decider computes empirical channel correlation () to select between channel-independent and channel-mixing tokenizations per dataset.
- Dual Twin Mamba (DTMamba) (Wu et al., 2024): Arranges two “TwinMamba” blocks in series, with each block comprising parallel Mamba branches for low- and high-level features, interleaved residual connections, and channel-independence enforced via reshape. The architecture is robust under hyperparameter sweeps and SOTA on both low- and high-channel tasks.
- MAT (Mamba–Attention Transformer) (Zhang et al., 2024): Fuses the global temporal memory of Mamba with local self-attention, performing parallel SSM scan and (multi-head) Transformer attention and then concatenating/fusing the results, targeting scenarios where short-range semantic structure is critical.
- Hierarchical Information-Guided Spatio-Temporal Mamba (HIGSTM) (Yan et al., 14 Mar 2025): For large-scale financial data, HIGSTM applies index-guided frequency decomposition into commonality/specificity, and cascades node-independent SSM, sparse temporal neighbor aggregation, and global fully-connected SSM stages, with macro signals (from the index) gated into each SSM parameter generation. Ablation confirms the necessity of all components for strong IC/Sharpe scores.
4. Probabilistic Forecasting and Uncertainty Quantification
Mamba-ProbTSF extends deterministic MambaTS by appending a feedforward network to predict aleatoric (Gaussian) variance for each forecasted point (Pessoa et al., 13 Mar 2025). Two networks operate in tandem: one for point forecast (), one for log-variance (), yielding a factorized Gaussian predictive distribution. The negative log-likelihood is minimized, and output coverage is empirically calibrated—achieving approximately 95% empirical coverage at the two-sigma level on electricity and traffic benchmarks. The method’s limitation is its failure to model uncertainty accumulation for processes such as pure Brownian motion, since Mamba’s state ODE does not propagate variance in a non-stationary (root-) manner.
5. Computational Complexity and Scalability
All MambaTS variants preserve the linear-time, low-memory foundation of the original Mamba selective SSM: per-layer cost is (for variables, sequence length , patch size ). Unlike Transformer attention () and MICN (), MambaTS scales efficiently even when number of variables or look-back increases dramatically (Cai et al., 2024, Ma et al., 2024). Empirical measurements show 30–50% lower training time than PatchTST at equivalent batch sizes, and up to 4× faster inference in large-scale stock backtest settings (Yan et al., 14 Mar 2025, Zou et al., 2024).
Industrial-scale applications such as Hunyuan-TurboS (Team et al., 21 May 2025) deploy hybrid Mamba–Transformer mixtures with mixture-of-experts (MoE) gating, 256K context support, adaptive chain-of-thought switching, and Grouped-Query Attention (GQA) for substantial reductions in KV cache size and active parameter footprint.
6. Empirical Evaluation and Comparative Performance
Across a diverse range of benchmarks and domains (Weather, Traffic, Electricity, Solar, Exchange, ETT{x} and large-scale stock indices), MambaTS-based architectures consistently deliver state-of-the-art or competitive results:
| Dataset | PatchTST MSE | iTransformer MSE | MambaTS MSE | DTMamba MSE | Bi-Mamba+ MSE | HIGSTM (IC) |
|---|---|---|---|---|---|---|
| ETTm2 | 0.316 | 0.327 | 0.283 | — | — | — |
| Weather | — | — | 0.296 | 0.254 | — | — |
| CSI500 | — | — | — | — | — | 0.0791 |
On high-channel tasks, MambaTS closes gaps of up to 30% MSE over variable-independent models (Cai et al., 2024). DTMamba achieves the lowest or next-best average errors across all ETT, Weather, Traffic, and Exchange tasks (Wu et al., 2024). HIGSTM outperforms SOTA spatio-temporal deep models in IC and Sharpe ratio for major stock indices (Yan et al., 14 Mar 2025). MAT achieves both improved accuracy and a ∼2× reduction in required GPU memory compared to full Transformer models (Zhang et al., 2024).
Ablations in (Cai et al., 2024) confirm that each architectural intervention—VST, TMB, VPT, VAST—delivers cumulative gains. Performance degrades under naive channel order, lack of dropout, or standard (causal) convolution. Channel-mixing and attention on compressed axes further enhance multivariate tasks (Ma et al., 2024).
7. Theoretical Perspective and Best Practices
MambaTS exploits the mapping between state-space scans and sequence kernels—a parallel to attention’s kernelization—while preserving history-dependence and resource efficiency. The selection of discretization schedule () and careful initialization (S4D-Lin) are key to stable learning (Zou et al., 2024). Best practices include:
- Use VST/TMB for multivariate series;
- Apply VPT for robust variable order invariance;
- Integrate variable-wise analysis (as with VAST or SRA) for high-dimensional settings;
- Employ channel-mixing attention modules in fine-tuning for improved cross-channel inference;
- Utilize dropout on selective parameters to avoid overfitting on long horizons;
- For stochastic or non-stationary uncertainty, supplement vanilla Gaussian modeling with explicit stochastic process terms (Pessoa et al., 13 Mar 2025).
In sum, MambaTS and its architectural descendants represent a foundational shift in long-term time series modeling, robustly bridging the expressive power of deep SSMs with the scalability required for modern scientific and industrial applications (Cai et al., 2024, Ma et al., 2024, Zou et al., 2024, Yan et al., 14 Mar 2025, Pessoa et al., 13 Mar 2025).