Mamba-based TimeMachine Model

Updated 23 January 2026

The paper introduces a scalable deep learning framework that leverages selective state-space modeling and parallel Mamba blocks for effective long-term time series forecasting.
It employs a quadruple-Mamba architecture with multi-scale embedding to capture both global inter-channel and local intra-channel dependencies efficiently.
Empirical results show that the model achieves linear computational complexity and superior forecasting accuracy across multiple benchmark datasets.

The Mamba-based TimeMachine model is a scalable deep learning framework for long-term time series forecasting that leverages recent advances in selective state-space modeling. By integrating multiple parallel Mamba blocks and multi-scale input representations, the TimeMachine achieves linear complexity with respect to both input sequence length and channel dimension, while maintaining or exceeding the predictive accuracy of Transformer-based approaches. Its architecture is specifically engineered to unify channel-mixing and channel-independence, facilitating robust extraction of both global and local sequence patterns at multiple temporal resolutions. The model is empirically validated to deliver superior accuracy, scalability, and memory efficiency on multivariate forecasting benchmarks (Ahamed et al., 2024).

1. Theoretical Foundations: Selective State Space Models and Mamba Block

TimeMachine's backbone relies on the selective scan continuous-time state-space model (SSM) introduced in Mamba [gu2023mamba]. For a sequence input, the continuous SSM dynamics are: $\frac{d\,h(t)}{dt} = A\,h(t) + B\,u(t),\qquad v(t) = C\,h(t),$ with $u(t)\in\mathbb{R}^D$ the input token, $h(t)\in\mathbb{R}^N$ the state, and $v(t)$ the output. Discretization at time step $k\Delta$ gives: $h_k = \bar A\,h_{k-1} + \bar B\,u_k,\quad v_k = C\,h_k,$ where $\bar A = \exp(\Delta A)$ , $\bar B = (\Delta A)^{-1}[\exp(\Delta A) - I]\Delta B$ . In Mamba, coefficients $B,C,\Delta$ are input-dependent (via projection layers and activation functions), yielding highly adaptive content selection at each step.

Mamba's input-dependent parameterization enables linear-time recurrence while selectively integrating relevant contextual tokens, thus capturing both short- and long-range dependencies in sequential data. Each block comprises (i) a 1D convolution and SiLU activation branch, followed by SSM scan, and (ii) a parallel linear projection branch; their outputs are multiplied elementwise before projection, then summed with input through residual paths.

2. Multi-Scale, Multi-Resolution Embedding and Quadruple-Mamba Design

TimeMachine's distinctive feature is its multi-scale, integrated quadruple-Mamba architecture. The input sequence $x\in\mathbb{R}^{M\times L}$ (M = channel count, L = look-back window) is successively embedded via two MLPs $E_1: \mathbb{R}^{M\times L} \to \mathbb{R}^{M\times n_1}$ , $E_2: \mathbb{R}^{M\times n_1} \to \mathbb{R}^{M\times n_2}$ , forming “high-resolution” and “low-resolution” representations.

At each embedding level, two parallel Mamba blocks operate:

Global context: Treats each channel as a token, modeling inter-channel dependencies (shape $B\times M\times n_i$ ).
Local context: Treats each timestamp or channel independently (shape $(BM)\times 1\times n_i$ ).

Thus, four Mamba blocks are deployed—two at each resolution, capturing global and local structures. Channel-mixing and channel-independence are handled by reshaping and fusing outputs according to the required intra- or inter-channel modeling scenario.

Multi-scale fusion is performed using elementwise sums and skip connections: $x^{(3)} = x^{(2)} \oplus v_L \oplus v_R, \quad x^{(4)} = P_1(x^{(3)}), \quad x^{(5)} = v_L^* \oplus v_R^*,$ where $v_L, v_R$ and $v_L^*, v_R^*$ are outputs from inner and outer Mambas at low/high resolution, respectively. The final output is concatenated with skip connections and projected to the forecast horizon: $x^{(6)} = [x^{(5)} \parallel (x^{(4)} \oplus x^{(1)})],\quad y = P_2(x^{(6)}).$

3. Handling Channel-Mixing and Channel-Independence

A central design in TimeMachine is the unified treatment of both channel-mixing and channel-independence. Channel-mixing keeps the shape $B\times M\times n_i$ (all channels as tokens), which is optimal for leveraging global inter-channel dependencies. Channel-independence reshapes inputs to $(BM)\times 1\times n_i$ , isolating each channel, which is suitable for datasets where channels are only weakly correlated.

This dual-mode processing allows TimeMachine to adapt dynamically: mixing is advantageous for datasets with global structure (e.g., electricity demand across regions), while independence is beneficial for more heterogeneous multi-channel datasets. Context extraction at both high and low resolutions enables the model to distinguish between fine-grained local phenomena and broad temporal or cross-channel patterns.

4. Computational Complexity and Scalability

The quadruple-Mamba network is engineered for linear complexity in both sequence length $L$ and channel count $M$ :

Each SSM scan on a sequence of length $n$ with state size $N$ costs $O(nN)$ per token.
All affine projections (embeddings, projections) scale as $O(Mn_i^2)$ or $O(Mn_iT)$ , independent of input length.
Only the first embedding depends on $L$ , while all subsequent blocks are invariant to history length, guaranteeing linear scaling.

Empirical profiling confirms that TimeMachine’s memory footprint is competitive with or lower than DLinear, even for large-scale datasets (e.g., Traffic, $M=862$ ). In contrast, Transformer-based models scale quadratically with sequence length.

5. Empirical Performance and Practical Implementation

TimeMachine has been benchmarked on Weather, Traffic, Electricity, ETTh1/2, and ETTm1/2 using standard metrics (MSE, MAE) and horizons $(T = 96,192,336,720; L = 96)$ . Against 11 state-of-the-art baselines (Autoformer, Informer, PatchTST, iTransformer, TiDE, DLinear, TimesNet, etc.), TimeMachine consistently ranks first or second, often improving MSE by 5–15%.

Longer look-back windows (e.g., $L=336,720$ ) further boost robustness, with accuracy degrading gracefully as forecast horizon increases. Ablations highlight the critical role of residual connections (reducing MSE by 1–3%), RevIN normalization, and high dropout ( $p\approx 0.7$ post-embedding) in generalization, especially under limited data.

6. Training, Inference, and Implementation Considerations

The loss function is mean squared error: $\mathcal{L} = \frac{1}{BMT} \sum_{b,i,t} (y_{b,i,t} - x_{b,i,L+t})^2.$ The optimizer is Adam, typically for up to 100 epochs. Hyperparameters include token sizes ( $n_1=256, n_2=64$ ), state size ( $N=256$ ), convolution width (2), and expansion factor ( $E=1$ ). Residual and skip connections are pervasive for architectural stability.

At inference, the pipeline enables batch forecasting across all future steps in a single sweep, leveraging the linear-time property and avoiding error accumulation characteristic of autoregressive decoding.

7. Limitations and Future Research Directions

The current implementation of TimeMachine focuses exclusively on supervised forecasting, with self-supervised or foundation model pretraining left as an open direction. Hyperparameter tuning across domains remains nontrivial, particularly for token size and dropout. A theoretical analysis of selective SSMs’ generalization in multivariate contexts is also an outstanding problem.

A plausible implication is that the general strategy of multi-scale, multi-context Mamba architectures can extend to other sequential modeling domains requiring both linear scalability and selective context extraction.

References:

"TimeMachine: A Time Series is Worth 4 Mambas for Long-term Forecasting" (Ahamed et al., 2024)

Markdown Upgrade to Chat

References (1)

TimeMachine: A Time Series is Worth 4 Mambas for Long-term Forecasting (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba-based TimeMachine Model.

Mamba-based TimeMachine Model

1. Theoretical Foundations: Selective State Space Models and Mamba Block

2. Multi-Scale, Multi-Resolution Embedding and Quadruple-Mamba Design

3. Handling Channel-Mixing and Channel-Independence

4. Computational Complexity and Scalability

5. Empirical Performance and Practical Implementation

6. Training, Inference, and Implementation Considerations

7. Limitations and Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Mamba-based TimeMachine Model

1. Theoretical Foundations: Selective State Space Models and Mamba Block

2. Multi-Scale, Multi-Resolution Embedding and Quadruple-Mamba Design

3. Handling Channel-Mixing and Channel-Independence

4. Computational Complexity and Scalability

5. Empirical Performance and Practical Implementation

6. Training, Inference, and Implementation Considerations

7. Limitations and Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research