Local State-Evolution Mamba (LSEMba)

Updated 17 November 2025

LSEMba is a neural modeling architecture that augments Mamba SSMs with local context modules, enhancing spatial, temporal, and neighborhood dependency capture.
It integrates state-space recurrences with mechanisms like convolutional windowing and motion extraction, ensuring linear complexity and improved task fidelity.
Empirical benchmarks in graph, video, and image domains demonstrate LSEMba’s efficacy through selective gating and adaptive multiscale fusion.

Local State-Evolution Mamba (LSEMba) refers to a general architectural pattern for neural sequence and structured data modeling, in which the selective state-space modeling formalism popularized by Mamba is systematically augmented to capture essential local (spatial, temporal, or neighborhood) dependencies that are otherwise neglected by purely global or one-dimensional SSM recurrences. First introduced in the context of deep graph learning (He et al., 10 Nov 2025), LSEMba and related local/“enhanced” Mamba blocks have since appeared in diverse applications such as micro-gesture recognition (Li et al., 12 Oct 2025) and multispectral image fusion (Cao et al., 2024). In all instantiations, LSEMba combines the computational efficiency and expressive selectivity of the vanilla Mamba SSM with mechanisms for local context modeling, yielding models with strictly linear complexity in sequence (or graph) size and improved empirical fidelity on tasks requiring both long- and short-range context integration.

1. Mathematical Foundations: From Mamba SSM to Local State Evolution

The core of LSEMba is rooted in the continuous-time state-space formalism introduced in Mamba (Gu et al., 2023), where for observable $\mathbf{x}(t)$ , hidden state $\mathbf{h}(t)$ , and output $\mathbf{y}(t)$ ,

$\frac{d}{dt}\mathbf{h}(t) = \mathbf{A} \mathbf{h}(t) + \mathbf{B} \mathbf{x}(t), \quad \mathbf{y}(t) = \mathbf{C} \mathbf{h}(t).$

Discretization with step size $\Delta$ yields

$\mathbf{h}_t = \exp(\Delta \mathbf{A}) \mathbf{h}_{t-1} + \Delta \mathbf{A}^{-1} (\exp(\Delta \mathbf{A}) - \mathbf{I}) \mathbf{B}\, \mathbf{x}_t, \quad \mathbf{y}_t = \mathbf{C} \mathbf{h}_t.$

Input conditioning is achieved by parameterizing $\mathbf{B}_t, \mathbf{C}_t, \Delta_t$ as functions of $\mathbf{x}_t$ , yielding a selective, content-aware state evolution at each timepoint. This mechanism equips the model with dynamic gating capabilities, where the recurrence can adaptively remember or forget past states based on the current input.

However, the fundamental SSM recurrence is one-dimensional and unidirectional: modeling dependencies along only a single sequence or layer axis, and inherently limited in explicitly capturing two-dimensional spatial locality (for images) or local multi-hop topology (for graphs). The LSEMba concept addresses this limitation by integrating local context aggregation (via convolutions, windowing, or neighborhood-specific recurrences) with the SSM backbone.

2. LSEMba Mechanisms Across Data Modalities

2.1. Graph Structured Data (GNNs)

In (He et al., 10 Nov 2025), LSEMba is introduced as a dual-stage module for deep graph neural networks:

Multi-hop neighborhood aggregation via standard GCN propagation,

$\mathbf{X}^{(l)} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2} \mathbf{X}^{(l-1)},$

generating a sequence of node representations for each hop depth.

Node-specific selective state evolution: for each node $i$ , assemble the depthwise sequence $S_i = [\mathbf{x}_i^{(0)}, ..., \mathbf{x}_i^{(L)}]$ and apply a Mamba-style SSM scan, where parameters of the local SSM, namely $(\mathbf{P}^F, \mathbf{Q}^F, \mathbf{R}^F, \Delta^F)$ , are produced per-node through small neural networks on $S_i$ . The resulting recurrence

$\mathbf{h}_t^F = \bar{\mathbf{P}}_t^F \mathbf{h}_{t-1}^F + \bar{\mathbf{Q}}_t^F \mathbf{x}^{(t)}, \qquad \mathbf{y}_{i,t}^F = \mathbf{R}_t^F \mathbf{h}_t^F$

selectively propagates, forgets, and integrates features across graph layers, effectively mitigating over-smoothing by preserving node-specific representation dynamics.

2.2. Video and Spatiotemporal Data

In the micro-gesture recognition context (Li et al., 12 Oct 2025), LSEMba (branded as MSF-Mamba) extends the SSM backbone with a local, motion-aware spatiotemporal fusion module:

Local 3D State Aggregation: After bidirectional SSM encoding, the 1D sequence is reshaped into a video-shaped tensor $F \in \mathbb{R}^{d \times T \times H' \times W'}$ .
Motion Extraction: Central Frame Difference (CFD) computes a local motion signal,

$D_t = F_t - \tfrac12 (F_{t-1} + F_{t+1}), \qquad t=2,...,T-1.$

Local Fusion: For each window scale $k$ of radius $r_k$ , a 3D aggregation kernel $W_k$ is applied to both $F$ and $D$ :

$\mathcal{S}_k(X)_t = \sum_{(\delta\tau, \delta h, \delta w) \in \mathcal{N}_k} W_k(\delta\tau, \delta h, \delta w) X_{t+\delta\tau, h+\delta h, w+\delta w},$

yielding fused outputs $F^{(k)} = \mathcal{S}_k(F) + \theta_k \mathcal{S}_k(D)$ , with $\theta_k$ a learned gate.

Multiscale Adaptivity (MSF-Mamba $^+$ ): Outputs from different receptive field widths are stacked and weighted via an adaptive scale weighting module (ASWM), producing a final feature map that emphasizes the most relevant local context per spatial–temporal location.

2.3. Vision: Image Fusion and Dense Prediction

LE-Mamba (Cao et al., 2024) implements LSEMba for vision by interleaving local and global SSM scans. Each "Local-Enhanced Vision Mamba (LEVM)" block partitions the feature map into windows, applies an SSM block (SS2D Mamba) to each window, then merges and globally scans the entire map. This aggressively integrates fine spatial detail without sacrificing global sequence modeling.

An additional “state sharing” mechanism propagates both local and global SSM states between adjacent and skip-connected layers, and includes a spectral–spatial learning (S2L) step that explicitly ties hidden SSM states to the feature tensor, enhancing representational richness and preventing degradation over depth.

3. Implementation Pattern and Pseudocode

The canonical LSEMba implementation for a sequence or graph-structured pipeline comprises:

Context sequence extraction: For graphs, via standard GNN propagation; for video/images, via patch embedding and SSM scan.
Sequence-wise selective state evolution: Per context unit (node or pixel/patch), apply a discretized, parameter-conditioned SSM scan (vanilla Mamba recurrence).
Local context fusion: Aggregate neighboring context via 3D convolution, windowed state scan, or explicit local SSMs.
Gating and adaptivity: Apply learned gates or attention across local contexts or scales.
Final projection and prediction: Head network for downstream tasks.

A high-level algorithmic sketch for the GNN application (He et al., 10 Nov 2025):

for i in range(N):
    S_i = [X[i, :] for X in X_layers] # (L+1, d)
    Q_i, R_i, Delta_i = MLP_1(S_i), MLP_2(S_i), MLP_3(S_i)
    P_bar = exp(Delta_i * P_i)
    Q_bar = (Delta_i * P_i)^(-1) * (exp(Delta_i * P_i) - I) * Q_i * Delta_i
    h = 0
    for t in range(1, L+1):
        h = P_bar[t] @ h + Q_bar[t] @ x_i[t]
        y_i = R_i[t] @ h
    Y[i, :] = y_i

Complexity for all instantiations is

O(n)

, where

n

is the sequence length or number of graph hops.

4. Architectural Variants and Design Trade-offs

LSEMba architectures adapt to the modality and task through the design of their local modules and scale hierarchies:

Variant	Local Mechanism	Multiscale/ASWM	Domain	Parameters
MSF-Mamba	3D conv + motion (CFD)	Optional	Video	8.9–91.6M
MSF-Mamba $^+$	3D conv + ASWM	Yes	Video	24.9–235.3M
LEVM/LE-Mamba	Windowed SSM + global	Yes (U-Net)	Image fusion	sub-1M
GNN LSEMba	Depthwise SSM	N/A	Node representation	N/A

Patch sizes, SSM depth, and hidden dimensions are chosen following empirical trade-offs between computational cost, local context size, and desired parameter count. For example, MSF-Mamba $^+$ uses patch size $16\times16$ , up to 32 SSM layers, and combines $3\times 3\times 3$ to $7\times 7\times 7$ cubes for local fusion.

5. Theoretical Properties and Expressivity

The selective SSM recurrence underlying LSEMba is mathematically equivalent to a depthwise convolution with content- and depth-conditioned kernels, with provable stability and rank-preserving guarantees.
Input-dependent state evolution prevents uniform smoothing and promotes discriminability across deep stacks, specifically countering the over-smoothing phenomenon in deep GNNs (He et al., 10 Nov 2025).
For sequence and vision tasks, gating and dynamic fusion provide adaptivity beyond static kernels, yielding expressivity on par with attention in $O(n)$ time, not $O(n^2)$ .

6. Empirical Performance and Benchmarking

On SMG and iMiGUE datasets:

Model	SMG Top-1	iMiGUE Top-1
VideoMamba (SSM)	53.3	58.1
LSEMba-Small	54.7	60.3
LSEMba $^+$ -Small	56.2	61.2

MSF-Mamba variants outperform CNNs and Transformers while retaining $O(n)$ efficiency; the lightweight LSEMba-Tiny model achieves <0.01s per clip improvement with higher accuracy.

On WV3 (pansharpening), LE-Mamba attains SAM=2.76, ERGAS=2.02, surpassing MTF-GLP-FS and PanMamba. On CAVE, LE-Mamba achieves PSNR≈49.86, SAM≈2.31, ERGAS≈0.70, SSIM≈0.997—consistent state-of-the-art across fusion tasks with minimal parameter overhead.

Ablation studies show that LSEMba maintains node classification accuracy above 94% on Cora, Pubmed, and Photo even for stack depths of 32 layers, unlike GCNs whose performance deteriorates sharply (to 24% on Photo at 32 layers).

7. Limitations, Applications, and Prospective Research

LSEMba modules are strictly linear in time and memory but incur higher FLOPs compared to pure convolution due to per-token/patch parameterizations. Local modules are task-adaptive (e.g., 3D conv for video, windowed SSM for images, depthwise SSM for GNNs) and amenable to further scaling or integration with global attention; more aggressive local mixing (e.g., window shifting) remains an open direction. Empirically, LSEMba is well-suited for any structured prediction task requiring simultaneous long-range propagation and fine local detail: micro-gesture recognition, hyperspectral fusion, and deep node encoding in graphs. Widespread adoption has been spurred by the balance of cost, selectivity, and state-of-the-art performance as documented across published results.

A plausible implication is that LSEMba’s principles—content-aware selectivity plus adaptive local context fusion—will find further generalization and cross-modal integration in future efficient AI backbone architectures.

PDF Markdown Chat (Pro)

References (4)

Dual Mamba for Node-Specific Representation Learning: Tackling Over-Smoothing with Selective State Space Modeling (2025)

MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition (2025)

A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion (2024)

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Local State-Evolution Mamba (LSEMba).

Local State-Evolution Mamba (LSEMba)

1. Mathematical Foundations: From Mamba SSM to Local State Evolution

2. LSEMba Mechanisms Across Data Modalities

2.1. Graph Structured Data (GNNs)

2.2. Video and Spatiotemporal Data

2.3. Vision: Image Fusion and Dense Prediction

3. Implementation Pattern and Pseudocode

4. Architectural Variants and Design Trade-offs

5. Theoretical Properties and Expressivity

6. Empirical Performance and Benchmarking

6.1. Micro-Gesture Recognition (Li et al., 12 Oct 2025)

6.2. Dense Prediction (Image Fusion) (Cao et al., 2024)

6.3. Node Representation Learning (GNNs) (He et al., 10 Nov 2025)

7. Limitations, Applications, and Prospective Research

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Local State-Evolution Mamba (LSEMba)

1. Mathematical Foundations: From Mamba SSM to Local State Evolution

2. LSEMba Mechanisms Across Data Modalities

2.1. Graph Structured Data (GNNs)

2.2. Video and Spatiotemporal Data

2.3. Vision: Image Fusion and Dense Prediction

3. Implementation Pattern and Pseudocode

4. Architectural Variants and Design Trade-offs

5. Theoretical Properties and Expressivity

6. Empirical Performance and Benchmarking

6.1. Micro-Gesture Recognition (Li et al., 12 Oct 2025)

6.2. Dense Prediction (Image Fusion) (Cao et al., 2024)

6.3. Node Representation Learning (GNNs) (He et al., 10 Nov 2025)

7. Limitations, Applications, and Prospective Research

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research