Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mamba-DALA Module in Delay-Aware Attention

Updated 16 January 2026
  • Mamba-DALA is a delay-aware linear attention mechanism designed for explicit modeling of cross-variate interactions and time-lag dependencies in multivariate time series analysis.
  • It employs globally informed delay priors and token-level relative delays via rotary embeddings to precisely align temporal relationships among variables.
  • Empirical evaluations within the DeMa architecture show that integrating Mamba-DALA significantly enhances forecasting accuracy while maintaining linear computational efficiency.

The Mamba-DALA module is a delay-aware linear attention mechanism designed for efficient and explicit modeling of cross-variate interactions in multivariate time series (MTS) analysis. As a core innovation within the DeMa (Dual-Path Delay-Aware Mamba) architecture, Mamba-DALA achieves linear complexity in both sequence length and the number of variates while capturing fine-grained, time-lagged dependencies between variables—a challenge not addressed by previous univariate state-space models or quadratic-cost Transformer-style attention. Mamba-DALA introduces globally informed delay priors and token-level relative delays into the attention computation, resulting in state-of-the-art predictive performance and substantial computational efficiency for MTS forecasting, imputation, classification, and anomaly detection tasks (An et al., 9 Jan 2026).

1. Role and Design in Dual-Path DeMa Architecture

DeMa decomposes MTS modeling into two parallel computational paths:

  • Cross-Time Path (Mamba-SSD): Independently processes the temporal dynamics within each variate.
  • Cross-Variate Path (Mamba-DALA): Explicitly models interactions among all variates at each token (patch) step with delay-sensitivity.

Mamba-DALA is situated in the "Cross-Variate" path within each DuoMNet block of DeMa. Its primary function is to learn pairwise, delay-aware dependencies between variates in linear time. This module combines a global delay prior—obtained by maximizing cross-correlation at the original temporal resolution and converting it to patch-level shifts—with fine-grained, token-level relative delays modulated through rotary position embeddings (RoPE). This structure enables Mamba-DALA to decide, at each token step, when and to what extent a particular variate should attend to another, accounting for both content and temporal alignment (An et al., 9 Jan 2026).

2. Mathematical Formulation and Core Mechanisms

Let X^RL×N×D\hat X \in \mathbb{R}^{L \times N \times D} denote the Cross-Variate scanned input (where LL is token length, NN the number of variates, DD the embedding dimension). Mamba-DALA operates as follows:

  • Branching: X^\hat X is projected into a content branch XRL×N×DuX \in \mathbb{R}^{L \times N \times D_u} and a gate branch XgateRL×N×DuX_{\text{gate}} \in \mathbb{R}^{L \times N \times D_u}.
  • Projection: Content branch yields Queries, Keys, and Values via linear maps WQW_Q, WKW_K, WVRDu×DuW_V \in \mathbb{R}^{D_u \times D_u}.

2.1 Global Correlation Delay

For each variate pair (a,b)(a, b):

  • Compute the shift τab\tau_{ab} maximizing Pearson correlation:

τab=argmaxtcorr(Xa,:tXb,:)\tau_{ab} = \arg \max_t \operatorname{corr}\left(X_{a,:} \overset{t}{\to} X_{b,:}\right)

and resulting maximum correlation strength ρab\rho_{ab}.

  • Convert to patch-level delay:

Δab=round(τab/P),P=points per token\Delta_{ab} = \operatorname{round}\left(\tau_{ab} / P\right), \quad P = \text{points per token}

2.2 Token-Level Delay and Rotary Embedding

  • Use RoPE matrices RlΘRDu×DuR_l^\Theta \in \mathbb{R}^{D_u \times D_u}.
  • Effective token-pair delay between (a,l)(a, l) (query, series aa at token ll) and (b,j)(b, j) (key, series bb at token jj):

δl,jab=(lj)Δab\delta_{l,j}^{a \leftarrow b} = (l - j) - \Delta_{ab}

  • In practice, queries and keys are rotated by RlΘR_l^\Theta and Rj+ΔabΘR_{j+\Delta_{ab}}^\Theta.

2.3 Linear-Attention Output with Delay-Aware Aggregation

  • Employ Flatten kernel: ϕ(x)=f(ReLU(x))\phi(x) = f(\text{ReLU}(x)).
  • For each query (a,l)(a, l), the output is:

ya,l=RlΘϕ(Xa,lq)b=1Nρabj:j+Δabl(Rj+ΔabΘϕ(Xb,jk))Xb,jvϕ(Xa,lq)b=1Nρabj:j+Δablϕ(Xb,jk)y_{a,l} = R_l^\Theta \phi(X^q_{a,l}) \cdot \frac{ \sum_{b=1}^N \rho_{ab} \sum_{j: j+\Delta_{ab} \le l} \left( R_{j+\Delta_{ab}}^\Theta \phi(X^k_{b,j}) \right)^\top X^v_{b,j} }{ \phi(X^q_{a,l}) \cdot \sum_{b=1}^N \rho_{ab} \sum_{j: j+\Delta_{ab} \le l} \phi(X^k_{b,j})^\top }

  • Gated output: Yvariate,a,l=Linear(ya,lσ(Xa,lgate))Y_{\text{variate}, a, l} = \text{Linear}(y_{a, l} \odot \sigma(X^{\text{gate}}_{a, l})), with σ\sigma as the sigmoid activation.

3. Implementation Workflow and Computational Efficiency

The computation in Mamba-DALA proceeds by streaming over valid lag-aligned key-value pairs for each query, accumulating numerator and denominator terms modulated by the global delay prior and rotary embeddings without explicitly constructing the full attention matrix. The forward pass operates as follows:

  1. Branch X^\hat X into content and gating streams.
  2. Generate projected queries, keys, and values.
  3. Apply the Flatten kernel to queries and keys.
  4. For each variate and token, iterate across other variates and their referenced, delay-shifted tokens, accumulating weighted sums.
  5. Gate the resulting content using the gate branch and output projection.

This yields an overall complexity of O(LNDu2)\mathcal{O}(L \cdot N \cdot D_u^2) in FLOPs per block and O(LNDu)\mathcal{O}(L \cdot N \cdot D_u) in memory, in contrast to O((LN)2D)\mathcal{O}((L \cdot N)^2 \cdot D) for quadratic self-attention and with explicit cross-variate delay modeling not present in univariate state-space models (An et al., 9 Jan 2026).

Component Complexity per block (FLOPs) Notes
Mamba-DALA O(LNDu2)\mathcal{O}(L \cdot N \cdot D_u^2) Cross-variate, delay-aware linear attention
Mamba-SSD O(LNDh2)\mathcal{O}(L \cdot N \cdot D_h^2) Univariate temporal modeling
Transformer SA O((LN)2D)\mathcal{O}((L \cdot N)^2 \cdot D) Quadratic scaling with total token length

4. Architectural Integration and Post-Processing

Mamba-DALA is integrated after patch-embedding and cross-variate scan. Two-branch projections output delay-aware content and a gating signal. No normalization is applied inside Mamba-DALA; instead, its output YvariateY_\text{variate} is LayerNorm’ed prior to downstream fusion. The output is blended with the temporal path’s YtimeY_\text{time} via a weighted sum αLN(Ytime)+βLN(Yvariate)\alpha \cdot \mathrm{LN}(Y_\text{time}) + \beta \cdot \mathrm{LN}(Y_\text{variate}), followed by a feedforward network (FFN) and additional residual layers (An et al., 9 Jan 2026).

This dual-path structure, with parallel Mamba-SSD and Mamba-DALA modules, enables DeMa to simultaneously capture long-range intra-series dependencies and cross-variate delay-sensitive interactions with strict linear scalability.

5. Empirical Evaluation and Ablation Analysis

In ablation studies, the removal of the Mamba-DALA module ("w/o variate path") results in marked degradation of forecasting accuracy. For example, in long-term traffic forecasting, MSE increases from $0.382$ to $0.539$ (+41.1%), and in PEMS04 short-term forecasting, from $0.061$ to $0.142$. Replacing Mamba-DALA with a repeat of the Mamba-SSD block in the variate path also leads to performance deterioration. This demonstrates that a dedicated delay-aware linear attention mechanism is critical for capturing granular cross-variate dependencies and cannot be effectively replaced by generic state-space modeling (An et al., 9 Jan 2026).

6. Context, Significance, and Limitations

Mamba-DALA addresses once-standing deficits in scalable MTS modeling: explicit cross-variate interaction, disentanglement of temporal and variate correlations, and precise latent time-lag effect modeling, all with linear computational overhead. Its design enables DeMa to outperform Transformer-based and prior Mamba-based baselines on forecasting, imputation, anomaly detection, and classification tasks. A plausible implication is that Mamba-DALA may serve as a blueprint for other expressive, resource-efficient architectures in signal modeling where delay-sensitive, variable-wise dependencies are fundamental (An et al., 9 Jan 2026). Limitations, such as reliance on accurate pre-computed lag priors or challenges extending beyond pairwise interactions, merit further investigation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba-DALA Module.