Diffusion Mamba for Time Series (DiM-TS)

Updated 30 November 2025

DiM-TS is a generative modeling framework that combines diffusion models with adaptive state-space Mamba modules to capture long-range dependencies and inter-variable structure.
It introduces bidirectional attention and channel-aware blocks to efficiently model temporal and cross-channel interactions with linear complexity.
Empirical evaluations show that DiM-TS achieves state-of-the-art performance in time series imputation and generation on diverse real-world benchmarks.

Diffusion Mamba for Time Series (DiM-TS) encompasses a set of generative modeling frameworks for multivariate time series that combines the probabilistic denoising diffusion model paradigm with efficient, expressive state-space sequence modules—primarily the input-adaptive, linear-scan Mamba architecture. This integration targets and overcomes the longstanding challenges in time series imputation and generation: capturing long-range dependencies, modeling inter-variable (channel) structure, supporting bidirectional (past-future) reasoning, and achieving linear complexity in both sequence length and channel count. DiM-TS systems have attained state-of-the-art empirical performance in both probabilistic imputation and synthetic time series generation, as established on diverse real-world benchmarks (Gao et al., 17 Oct 2024, Yao et al., 23 Nov 2025, Solís-García et al., 8 Oct 2024).

1. Foundations: Diffusion Models and Mamba State-Space Modules

Diffusion models for time series follow the Denoising Diffusion Probabilistic Model (DDPM) framework, comprising a forward (noising) process that incrementally corrupts time series data $x_0$ over $T$ steps: $q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)$ and a reverse (denoising) process where a parameterized neural model learns to iteratively reconstruct cleaner versions from noise, optimizing a denoising score-matching loss: $\mathcal{L} = \mathbb{E}_{x_0,\epsilon,t} \|\epsilon - \epsilon_\theta(x_t,t)\|^2$ (Gao et al., 17 Oct 2024, Solís-García et al., 8 Oct 2024).

Mamba is a family of input-dependent state-space models (SSMs) characterized by efficient, linear-scan selective-scan algorithms and position- or content-dependent transition/output matrices. Its core update is: $x_k(t) = A_k x_k(t-1) + B_k u_k(t), \quad h_k(t) = C_k x_k(t)$ where $A_k,B_k,C_k$ are adaptive functions of the input sequence. This input adaptivity and non-causal construction enable both efficient sequence processing ( $O(L)$ in length $L$ ) and global, bidirectional, or cross-channel dependencies (Gao et al., 17 Oct 2024, Yao et al., 23 Nov 2025).

2. Architectural Innovations: Bidirectional and Channel-aware Modeling

DiM-TS introduces two primary module types to address temporal and inter-channel dependencies:

Bidirectional Attention Mamba (BAM): Stacks of BAM blocks process temporal representations in both forward (left-to-right) and backward (right-to-left) directions. Each block learns attention weights $w$ to modulate the influence of each time lag, combining bidirectional outputs and supporting dependencies at arbitrary locations in the sequence. This mechanism is essential for imputation, where information on both sides of a missing value is available (Gao et al., 17 Oct 2024, Solís-García et al., 8 Oct 2024).
Channel Mamba Block (CMB): To efficiently encode cross-variable interactions, the CMB transposes the input, treats channels as the scan axis, applies a unidirectional Mamba SSM, and uses a lightweight attention mechanism to capture inter-series correlations. By operating in $O(KL)$ for $K$ channels and length $L$ , DiM-TS achieves effective, scalable modeling of multivariate structure (Gao et al., 17 Oct 2024).

Further, architectural variants such as Lag Fusion Mamba and Permutation Scanning Mamba have been introduced to explicitly encode periodic or seasonal temporal lags and to reorder channels by similarity, respectively. Lag Fusion Mamba injects inductive biases for user-defined or learned lags by incorporating hidden states at predetermined offsets, implemented as convolutions with dilations. Permutation Scanning uses graph Laplacian embeddings of channel similarity matrices to construct scan orders where correlated variables become neighbors, improving cross-channel generative fidelity (Yao et al., 23 Nov 2025).

3. Integration with Diffusion Frameworks

The backbone modules are embedded in U-Net-like or stacked encoder-decoder structures for parameterizing the denoiser $\epsilon_\theta$ or mean predictor $\mu_\theta$ in the reverse diffusion process. Dual-pathway Mamba models process both temporal and channel axes in parallel, using specialized encoders (e.g., Diffusion Fusion Mamba for time, Diffusion Permutation Mamba for channel). Loss functions augment the standard denoising MSE with auxiliary objectives—such as FFT-based frequency-domain preservation and maximum mean discrepancy (MMD) regularization of inter-channel correlation structure—to ensure generated series maintain both short- and long-term patterns and realistic covariance (Yao et al., 23 Nov 2025).

The imputation variant conditions on observed values and masks, and bespoke training/inference protocols restrict loss computation to unobserved (target) locations in the data. Mask-aware architectures and mask injection as input features ensure robustness to various missing data patterns (block, point, historical) (Solís-García et al., 8 Oct 2024).

4. Complexity and Scalability

DiM-TS achieves linear complexity in both sequence length and number of channels, in contrast with self-attention-based Transformers that require $O(CL^2)$ computation for hidden size $C$ . The precise complexity is $O(NCL)$ , with memory $O(CL + N(C+L))$ for SSM state size $N$ . Notably, these architectures remain efficient even under extensions to bidirectional, multi-lag, or multi-channel scans due to the underlying scan and convolutional algorithms (Gao et al., 17 Oct 2024, Yao et al., 23 Nov 2025).

A comparative summary is provided below:

Model Type	Time Complexity	Bidirectionality	Inter-Channel Modeling
Transformer	$O(CL^2)$	Optional	Global, high cost
RNN/CNN	$O(CL)$	Limited	Limited, per block
DiM-TS (Mamba+SSM)	$O(NCL)$	Native	Linear, explicit

5. Empirical Evaluation

DiM-TS and related implementations have attained leading results on benchmark datasets in both imputation and generation tasks:

Imputation: On MuJoCo random-missing (70%, 80%, 90%) DiM-TS achieves MSE as low as $6.5 \times 10^{-4}$ versus previous $1.9 \times 10^{-3}$ . On Electricity, best or near-best MAE and RMSE are observed (e.g., at 30% missing, MAE = 0.348 vs. prior 0.407). Probabilistic imputation as measured by CRPS-sum yields a 21.4% gain over the next-best method (Gao et al., 17 Oct 2024).
Generation: On Google stocks, ETTh, Energy, KDD-Cup AQI, DiM-TS improves Context-FID by over 60% and correlational scores by more than 35% compared to diffusion and GAN-based alternatives. Predictive and discriminative metrics confirm improved downstream task utility and realism (Yao et al., 23 Nov 2025).
Robustness: Performance remains stable even for long sequences ( $L=128,256$ ), unlike rival methods where quality degrades. Ablation studies highlight dominant contributions from lag fusion and bidirectional modules (Yao et al., 23 Nov 2025).

The TIMBA system, which adapts Diffusion Mamba with both bi-directional (temporal) and spatial (graph-based, node-oriented transformer) modules, demonstrates consistent improvements in MAE/MSE imputation metrics on AQI-36, METR-LA, and PEMS-BAY under varied missingness patterns. Downstream forecasting models benefit from preprocessing with DiM-TS imputations (Solís-García et al., 8 Oct 2024).

6. Limitations, Open Directions, and Context

Current DiM-TS instantiations fix lag sets and channel permutations offline; adaptive or data-driven discovery of these structures may enhance performance. For high-dimensional channel spaces, the computational cost associated with graph Laplacian embedding could be reduced by approximate algorithms or hierarchical clustering. Conditioning on exogenous variables and handling irregularly sampled or asynchronous time series represent further research frontiers. Accelerating sampling through step distillation and exploring alternative noise schedules are proposed as practical extensions (Yao et al., 23 Nov 2025, Gao et al., 17 Oct 2024).

From a methodological viewpoint, DiM-TS unifies SSMs and diffusion models within a single algebraic paradigm (output $y = Mx$ ), differing mainly in the structure imposed on $M$ by bidirectionality, lag fusion, or channel permutation. This suggests that future research may benefit from generalized matrix parameterizations and information flow control across both axes.

Prior art in time series diffusion models often relied on Transformer backbones for the denoiser, which incurred $O(L^2)$ computational cost and limited scalability for long or high-dimensional data. Attempts to model bidirectional or cross-channel dependencies with Transformers or RNNs required complex masking or incurred quadratic overhead. The use of Mamba-based SSMs moves beyond strict RNN/attention paradigms and offers both theoretical and practical advancement, as demonstrated in head-to-head empirical comparisons (Gao et al., 17 Oct 2024, Yao et al., 23 Nov 2025, Solís-García et al., 8 Oct 2024). These advances position Diffusion Mamba for Time Series as a central architectural paradigm for scalable, probabilistic time series learning.