Mamba-DALA Module in Delay-Aware Attention
- Mamba-DALA is a delay-aware linear attention mechanism designed for explicit modeling of cross-variate interactions and time-lag dependencies in multivariate time series analysis.
- It employs globally informed delay priors and token-level relative delays via rotary embeddings to precisely align temporal relationships among variables.
- Empirical evaluations within the DeMa architecture show that integrating Mamba-DALA significantly enhances forecasting accuracy while maintaining linear computational efficiency.
The Mamba-DALA module is a delay-aware linear attention mechanism designed for efficient and explicit modeling of cross-variate interactions in multivariate time series (MTS) analysis. As a core innovation within the DeMa (Dual-Path Delay-Aware Mamba) architecture, Mamba-DALA achieves linear complexity in both sequence length and the number of variates while capturing fine-grained, time-lagged dependencies between variables—a challenge not addressed by previous univariate state-space models or quadratic-cost Transformer-style attention. Mamba-DALA introduces globally informed delay priors and token-level relative delays into the attention computation, resulting in state-of-the-art predictive performance and substantial computational efficiency for MTS forecasting, imputation, classification, and anomaly detection tasks (An et al., 9 Jan 2026).
1. Role and Design in Dual-Path DeMa Architecture
DeMa decomposes MTS modeling into two parallel computational paths:
- Cross-Time Path (Mamba-SSD): Independently processes the temporal dynamics within each variate.
- Cross-Variate Path (Mamba-DALA): Explicitly models interactions among all variates at each token (patch) step with delay-sensitivity.
Mamba-DALA is situated in the "Cross-Variate" path within each DuoMNet block of DeMa. Its primary function is to learn pairwise, delay-aware dependencies between variates in linear time. This module combines a global delay prior—obtained by maximizing cross-correlation at the original temporal resolution and converting it to patch-level shifts—with fine-grained, token-level relative delays modulated through rotary position embeddings (RoPE). This structure enables Mamba-DALA to decide, at each token step, when and to what extent a particular variate should attend to another, accounting for both content and temporal alignment (An et al., 9 Jan 2026).
2. Mathematical Formulation and Core Mechanisms
Let denote the Cross-Variate scanned input (where is token length, the number of variates, the embedding dimension). Mamba-DALA operates as follows:
- Branching: is projected into a content branch and a gate branch .
- Projection: Content branch yields Queries, Keys, and Values via linear maps , , .
2.1 Global Correlation Delay
For each variate pair :
- Compute the shift maximizing Pearson correlation:
and resulting maximum correlation strength .
- Convert to patch-level delay:
2.2 Token-Level Delay and Rotary Embedding
- Use RoPE matrices .
- Effective token-pair delay between (query, series at token ) and (key, series at token ):
- In practice, queries and keys are rotated by and .
2.3 Linear-Attention Output with Delay-Aware Aggregation
- Employ Flatten kernel: .
- For each query , the output is:
- Gated output: , with as the sigmoid activation.
3. Implementation Workflow and Computational Efficiency
The computation in Mamba-DALA proceeds by streaming over valid lag-aligned key-value pairs for each query, accumulating numerator and denominator terms modulated by the global delay prior and rotary embeddings without explicitly constructing the full attention matrix. The forward pass operates as follows:
- Branch into content and gating streams.
- Generate projected queries, keys, and values.
- Apply the Flatten kernel to queries and keys.
- For each variate and token, iterate across other variates and their referenced, delay-shifted tokens, accumulating weighted sums.
- Gate the resulting content using the gate branch and output projection.
This yields an overall complexity of in FLOPs per block and in memory, in contrast to for quadratic self-attention and with explicit cross-variate delay modeling not present in univariate state-space models (An et al., 9 Jan 2026).
| Component | Complexity per block (FLOPs) | Notes |
|---|---|---|
| Mamba-DALA | Cross-variate, delay-aware linear attention | |
| Mamba-SSD | Univariate temporal modeling | |
| Transformer SA | Quadratic scaling with total token length |
4. Architectural Integration and Post-Processing
Mamba-DALA is integrated after patch-embedding and cross-variate scan. Two-branch projections output delay-aware content and a gating signal. No normalization is applied inside Mamba-DALA; instead, its output is LayerNorm’ed prior to downstream fusion. The output is blended with the temporal path’s via a weighted sum , followed by a feedforward network (FFN) and additional residual layers (An et al., 9 Jan 2026).
This dual-path structure, with parallel Mamba-SSD and Mamba-DALA modules, enables DeMa to simultaneously capture long-range intra-series dependencies and cross-variate delay-sensitive interactions with strict linear scalability.
5. Empirical Evaluation and Ablation Analysis
In ablation studies, the removal of the Mamba-DALA module ("w/o variate path") results in marked degradation of forecasting accuracy. For example, in long-term traffic forecasting, MSE increases from $0.382$ to $0.539$ (+41.1%), and in PEMS04 short-term forecasting, from $0.061$ to $0.142$. Replacing Mamba-DALA with a repeat of the Mamba-SSD block in the variate path also leads to performance deterioration. This demonstrates that a dedicated delay-aware linear attention mechanism is critical for capturing granular cross-variate dependencies and cannot be effectively replaced by generic state-space modeling (An et al., 9 Jan 2026).
6. Context, Significance, and Limitations
Mamba-DALA addresses once-standing deficits in scalable MTS modeling: explicit cross-variate interaction, disentanglement of temporal and variate correlations, and precise latent time-lag effect modeling, all with linear computational overhead. Its design enables DeMa to outperform Transformer-based and prior Mamba-based baselines on forecasting, imputation, anomaly detection, and classification tasks. A plausible implication is that Mamba-DALA may serve as a blueprint for other expressive, resource-efficient architectures in signal modeling where delay-sensitive, variable-wise dependencies are fundamental (An et al., 9 Jan 2026). Limitations, such as reliance on accurate pre-computed lag priors or challenges extending beyond pairwise interactions, merit further investigation.