Multi-Head Differential Mamba

Updated 30 November 2025

The paper introduces a novel neural module that uses dual Mamba2 blocks with learnable scaling to differentially refine EEG representations.
MDM operates as a critical bottleneck within a U-Net SAMBA architecture, efficiently modeling long-range temporal dependencies in EEG signals.
Ablation studies confirm that MDM improves accuracy and AUROC by robustly suppressing redundant signal components compared to conventional methods.

Multi-Head Differential Mamba (MDM) is a neural network module designed to enhance the modeling of long-range temporal dependencies in highly redundant sequential signals, specifically in the context of electroencephalography (EEG) data. Introduced as part of the SAMBA framework for EEG foundation modeling, MDM operates as a bottleneck module in a U-shaped encoder–decoder built from state space model (SSM) layers, leveraging dual parallel Mamba2 blocks per attention head to selectively suppress redundant signal components and emphasize salient, task-relevant temporal structures. The design achieves O(T·D) time and memory complexity, offering a scalable alternative to quadratic multi-head self-attention in long-context settings (Hong et al., 23 Nov 2025).

1. Architectural Role in U-Shaped Mamba Encoder–Decoder

The MDM module is situated at the bottleneck (or "waist") of the SAMBA U-Net architecture, receiving input from the deepest encoder stage and producing a refined representation that is then upsampled by the decoder. The surrounding architecture comprises:

3D Spatial-Adaptive Input Embedding (SAIE)
Convolutional temporal-receptive front-end
Mamba2-based SSM blocks in encoder and decoder
Bottleneck MDM module
Symmetric upsampling and additional Mamba2 blocks in the decoder

The principal motivation for MDM emerges from the nature of EEG signals, which feature substantial temporal redundancy and a mixture of slow-varying and event-driven neural dynamics. While a single Mamba2 SSM layer can model long-range dependencies, it does not explicitly suppress redundant structures; the differential design of MDM directly addresses this need by contrasting parallel state-space pathways per feature head (Hong et al., 23 Nov 2025).

2. Mathematical Formulation

Let $X \in \mathbb{R}^{B \times T \times D}$ denote the input at the bottleneck, where $B$ is the batch size, $T$ the sequence length, $D$ the feature dimension. Decompose $D$ into $H$ heads with per-head width $d=D/H$ :

Head Splitting:

$X = [X_1,\, X_2,\,\ldots,\, X_H], \quad X_h \in \mathbb{R}^{B\times T\times d}$

Dual Mamba2 Processing Per Head: For each head $h$ ,

$Y_h^{(1)} = \mathrm{Mamba}_h^{(1)}(X_h),\quad Y_h^{(2)} = \mathrm{Mamba}_h^{(2)}(X_h)$

Differential Operation with Learnable Scaling: Introduce $\lambda_h\in\mathbb{R}^d$ ,

$\Delta_h = Y_h^{(1)} - \left(\lambda_h \odot Y_h^{(2)}\right)$

Head-wise Normalization:

$\widetilde{\Delta}_h = \mathrm{GroupNorm}_h(\Delta_h)$

Feature Re-aggregation and Residual:

$Z = \mathrm{Concat}(\widetilde{\Delta}_1,\ldots, \widetilde{\Delta}_H)$

$U = \mathrm{Linear}(Z),\qquad Y = X + U$

This construction maintains $O(T\cdot D)$ computational and memory complexity, scalable to long input sequences.

3. Intuition: Redundancy Suppression and Salience Enhancement

EEG time-series inherently contain both slow, globally correlated trends (e.g., baseline drift) and brief, event-related bursts (e.g., P300, oscillatory rhythms). The differential mechanism in MDM, comparing two independently parameterized Mamba2 paths per head and subtracting a learnable fraction of one from the other, is designed to:

Cancel out shared, static, or slow-varying signal components by subtraction.
Retain and emphasize rapidly changing, divergent, or salient activity.
Adaptively select which temporal dynamics to suppress or highlight, via the learnable vector $\lambda_h$ per head.
Stabilize dynamics across batches, heads, and time windows using GroupNorm, mitigating over-dominance by any single head.

The result is a refined bottleneck representation with reduced temporal redundancy and amplified transient patterns relevant to downstream tasks (Hong et al., 23 Nov 2025).

4. Computation and Pseudocode

The MDM forward pass is structured as follows:

def MDM_forward(X):
    head_outputs = []
    for h in range(H):
        X_h = X[..., h*d:(h+1)*d]         # Slice head
        Y1_h = Mamba1[h](X_h)             # First pathway
        Y2_h = Mamba2[h](X_h)             # Second pathway
        Δ_h = Y1_h - (λ[h] * Y2_h)        # Differential
        tΔ_h = GN[h](Δ_h)                 # Normalize
        head_outputs.append(tΔ_h)

    Z = concatenate(head_outputs, axis=-1) # Re-aggregate
    U = Linear(Z)
    return X + U                           # Residual output

No explicit positional encoding is used within MDM; context is provided by prior Mamba2 layers and the U-shaped architecture.

5. Complexity and Comparison to Attention Mechanisms

Module	Time Complexity	Memory Complexity	Positional Encoding
MDM	$O(T \cdot D)$	$O(T \cdot D)$	Not required in MDM
MHSA (softmax)	$O(T^2 \cdot D)$	$O(T^2)$	Required

MDM achieves linear scaling with sequence length $T$ , enabling tractable modeling of extended temporal contexts in EEG and other long-sequence modalities. In contrast, standard multi-head self-attention (MHSA) incurs quadratic overhead, limiting applicability to very long signals (Hong et al., 23 Nov 2025).

6. Implementation Notes

$\lambda_h$ vectors are typically initialized to $1.0$, so that early in training the two Mamba2 outputs substantially cancel, facilitating gradient flow and encouraging the network to identify informative divergences.
Head-wise GroupNorm (with typically $G=1$ or $G=d$ groups) ensures stable per-head dynamics.
MDM operates on outputs from the masked encoder; no further masking or positional induction is performed at this stage.
The residual connection is empirically necessary to maintain stability and preserve slow, global EEG trends.

7. Empirical Findings and Ablation Studies

Ablation experiments on the TUAB-100s EEG dataset substantiated the effectiveness of MDM:

Configuration	Balanced Accuracy (%)	AUROC	Δ Accuracy	Δ AUROC
Baseline SAMBA-100s (with MDM)	82.64	0.9054	—	—
MDM replaced by single Mamba2	80.01	0.8687	–2.63	–0.0367
Remove residual skip in MDM	81.41	0.8867	–1.23	–0.0187

Further ablations altering masking, losses, and replacing MDM with full convolution or full attention all resulted in performance losses, confirming the critical role of MDM for both stability and peak accuracy (Hong et al., 23 Nov 2025). In summary, the Multi-Head Differential Mamba module facilitates scalable, robust long-context modeling by directly suppressing temporal redundancy and recovering dynamic neural features essential for representation learning in EEG and similar domains.

Markdown Report Issue Upgrade to Chat

References (1)

SAMBA: Toward a Long-Context EEG Foundation Model via Spatial Embedding and Differential Mamba (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Head Differential Mamba.