Local State-Evolution Mamba (LSEMba)
- LSEMba is a neural modeling architecture that augments Mamba SSMs with local context modules, enhancing spatial, temporal, and neighborhood dependency capture.
- It integrates state-space recurrences with mechanisms like convolutional windowing and motion extraction, ensuring linear complexity and improved task fidelity.
- Empirical benchmarks in graph, video, and image domains demonstrate LSEMba’s efficacy through selective gating and adaptive multiscale fusion.
Local State-Evolution Mamba (LSEMba) refers to a general architectural pattern for neural sequence and structured data modeling, in which the selective state-space modeling formalism popularized by Mamba is systematically augmented to capture essential local (spatial, temporal, or neighborhood) dependencies that are otherwise neglected by purely global or one-dimensional SSM recurrences. First introduced in the context of deep graph learning (He et al., 10 Nov 2025), LSEMba and related local/“enhanced” Mamba blocks have since appeared in diverse applications such as micro-gesture recognition (Li et al., 12 Oct 2025) and multispectral image fusion (Cao et al., 14 Apr 2024). In all instantiations, LSEMba combines the computational efficiency and expressive selectivity of the vanilla Mamba SSM with mechanisms for local context modeling, yielding models with strictly linear complexity in sequence (or graph) size and improved empirical fidelity on tasks requiring both long- and short-range context integration.
1. Mathematical Foundations: From Mamba SSM to Local State Evolution
The core of LSEMba is rooted in the continuous-time state-space formalism introduced in Mamba (Gu et al., 2023), where for observable , hidden state , and output ,
Discretization with step size yields
Input conditioning is achieved by parameterizing as functions of , yielding a selective, content-aware state evolution at each timepoint. This mechanism equips the model with dynamic gating capabilities, where the recurrence can adaptively remember or forget past states based on the current input.
However, the fundamental SSM recurrence is one-dimensional and unidirectional: modeling dependencies along only a single sequence or layer axis, and inherently limited in explicitly capturing two-dimensional spatial locality (for images) or local multi-hop topology (for graphs). The LSEMba concept addresses this limitation by integrating local context aggregation (via convolutions, windowing, or neighborhood-specific recurrences) with the SSM backbone.
2. LSEMba Mechanisms Across Data Modalities
2.1. Graph Structured Data (GNNs)
In (He et al., 10 Nov 2025), LSEMba is introduced as a dual-stage module for deep graph neural networks:
- Multi-hop neighborhood aggregation via standard GCN propagation,
generating a sequence of node representations for each hop depth.
- Node-specific selective state evolution: for each node , assemble the depthwise sequence and apply a Mamba-style SSM scan, where parameters of the local SSM, namely , are produced per-node through small neural networks on . The resulting recurrence
selectively propagates, forgets, and integrates features across graph layers, effectively mitigating over-smoothing by preserving node-specific representation dynamics.
2.2. Video and Spatiotemporal Data
In the micro-gesture recognition context (Li et al., 12 Oct 2025), LSEMba (branded as MSF-Mamba) extends the SSM backbone with a local, motion-aware spatiotemporal fusion module:
- Local 3D State Aggregation: After bidirectional SSM encoding, the 1D sequence is reshaped into a video-shaped tensor .
- Motion Extraction: Central Frame Difference (CFD) computes a local motion signal,
- Local Fusion: For each window scale of radius , a 3D aggregation kernel is applied to both and :
yielding fused outputs , with a learned gate.
- Multiscale Adaptivity (MSF-Mamba): Outputs from different receptive field widths are stacked and weighted via an adaptive scale weighting module (ASWM), producing a final feature map that emphasizes the most relevant local context per spatial–temporal location.
2.3. Vision: Image Fusion and Dense Prediction
LE-Mamba (Cao et al., 14 Apr 2024) implements LSEMba for vision by interleaving local and global SSM scans. Each "Local-Enhanced Vision Mamba (LEVM)" block partitions the feature map into windows, applies an SSM block (SS2D Mamba) to each window, then merges and globally scans the entire map. This aggressively integrates fine spatial detail without sacrificing global sequence modeling.
An additional “state sharing” mechanism propagates both local and global SSM states between adjacent and skip-connected layers, and includes a spectral–spatial learning (S2L) step that explicitly ties hidden SSM states to the feature tensor, enhancing representational richness and preventing degradation over depth.
3. Implementation Pattern and Pseudocode
The canonical LSEMba implementation for a sequence or graph-structured pipeline comprises:
- Context sequence extraction: For graphs, via standard GNN propagation; for video/images, via patch embedding and SSM scan.
- Sequence-wise selective state evolution: Per context unit (node or pixel/patch), apply a discretized, parameter-conditioned SSM scan (vanilla Mamba recurrence).
- Local context fusion: Aggregate neighboring context via 3D convolution, windowed state scan, or explicit local SSMs.
- Gating and adaptivity: Apply learned gates or attention across local contexts or scales.
- Final projection and prediction: Head network for downstream tasks.
A high-level algorithmic sketch for the GNN application (He et al., 10 Nov 2025):
1 2 3 4 5 6 7 8 9 10 |
for i in range(N): S_i = [X[i, :] for X in X_layers] # (L+1, d) Q_i, R_i, Delta_i = MLP_1(S_i), MLP_2(S_i), MLP_3(S_i) P_bar = exp(Delta_i * P_i) Q_bar = (Delta_i * P_i)^(-1) * (exp(Delta_i * P_i) - I) * Q_i * Delta_i h = 0 for t in range(1, L+1): h = P_bar[t] @ h + Q_bar[t] @ x_i[t] y_i = R_i[t] @ h Y[i, :] = y_i |
4. Architectural Variants and Design Trade-offs
LSEMba architectures adapt to the modality and task through the design of their local modules and scale hierarchies:
| Variant | Local Mechanism | Multiscale/ASWM | Domain | Parameters |
|---|---|---|---|---|
| MSF-Mamba | 3D conv + motion (CFD) | Optional | Video | 8.9–91.6M |
| MSF-Mamba | 3D conv + ASWM | Yes | Video | 24.9–235.3M |
| LEVM/LE-Mamba | Windowed SSM + global | Yes (U-Net) | Image fusion | sub-1M |
| GNN LSEMba | Depthwise SSM | N/A | Node representation | N/A |
Patch sizes, SSM depth, and hidden dimensions are chosen following empirical trade-offs between computational cost, local context size, and desired parameter count. For example, MSF-Mamba uses patch size , up to 32 SSM layers, and combines to cubes for local fusion.
5. Theoretical Properties and Expressivity
- The selective SSM recurrence underlying LSEMba is mathematically equivalent to a depthwise convolution with content- and depth-conditioned kernels, with provable stability and rank-preserving guarantees.
- Input-dependent state evolution prevents uniform smoothing and promotes discriminability across deep stacks, specifically countering the over-smoothing phenomenon in deep GNNs (He et al., 10 Nov 2025).
- For sequence and vision tasks, gating and dynamic fusion provide adaptivity beyond static kernels, yielding expressivity on par with attention in time, not .
6. Empirical Performance and Benchmarking
6.1. Micro-Gesture Recognition (Li et al., 12 Oct 2025)
On SMG and iMiGUE datasets:
| Model | SMG Top-1 | iMiGUE Top-1 |
|---|---|---|
| VideoMamba (SSM) | 53.3 | 58.1 |
| LSEMba-Small | 54.7 | 60.3 |
| LSEMba-Small | 56.2 | 61.2 |
MSF-Mamba variants outperform CNNs and Transformers while retaining efficiency; the lightweight LSEMba-Tiny model achieves <0.01s per clip improvement with higher accuracy.
6.2. Dense Prediction (Image Fusion) (Cao et al., 14 Apr 2024)
On WV3 (pansharpening), LE-Mamba attains SAM=2.76, ERGAS=2.02, surpassing MTF-GLP-FS and PanMamba. On CAVE, LE-Mamba achieves PSNR≈49.86, SAM≈2.31, ERGAS≈0.70, SSIM≈0.997—consistent state-of-the-art across fusion tasks with minimal parameter overhead.
6.3. Node Representation Learning (GNNs) (He et al., 10 Nov 2025)
Ablation studies show that LSEMba maintains node classification accuracy above 94% on Cora, Pubmed, and Photo even for stack depths of 32 layers, unlike GCNs whose performance deteriorates sharply (to 24% on Photo at 32 layers).
7. Limitations, Applications, and Prospective Research
LSEMba modules are strictly linear in time and memory but incur higher FLOPs compared to pure convolution due to per-token/patch parameterizations. Local modules are task-adaptive (e.g., 3D conv for video, windowed SSM for images, depthwise SSM for GNNs) and amenable to further scaling or integration with global attention; more aggressive local mixing (e.g., window shifting) remains an open direction. Empirically, LSEMba is well-suited for any structured prediction task requiring simultaneous long-range propagation and fine local detail: micro-gesture recognition, hyperspectral fusion, and deep node encoding in graphs. Widespread adoption has been spurred by the balance of cost, selectivity, and state-of-the-art performance as documented across published results.
A plausible implication is that LSEMba’s principles—content-aware selectivity plus adaptive local context fusion—will find further generalization and cross-modal integration in future efficient AI backbone architectures.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free