Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 133 tok/s
Gemini 3.0 Pro 55 tok/s Pro
Gemini 2.5 Flash 164 tok/s Pro
Kimi K2 202 tok/s Pro
Claude Sonnet 4.5 39 tok/s Pro
2000 character limit reached

Local State-Evolution Mamba (LSEMba)

Updated 17 November 2025
  • LSEMba is a neural modeling architecture that augments Mamba SSMs with local context modules, enhancing spatial, temporal, and neighborhood dependency capture.
  • It integrates state-space recurrences with mechanisms like convolutional windowing and motion extraction, ensuring linear complexity and improved task fidelity.
  • Empirical benchmarks in graph, video, and image domains demonstrate LSEMba’s efficacy through selective gating and adaptive multiscale fusion.

Local State-Evolution Mamba (LSEMba) refers to a general architectural pattern for neural sequence and structured data modeling, in which the selective state-space modeling formalism popularized by Mamba is systematically augmented to capture essential local (spatial, temporal, or neighborhood) dependencies that are otherwise neglected by purely global or one-dimensional SSM recurrences. First introduced in the context of deep graph learning (He et al., 10 Nov 2025), LSEMba and related local/“enhanced” Mamba blocks have since appeared in diverse applications such as micro-gesture recognition (Li et al., 12 Oct 2025) and multispectral image fusion (Cao et al., 14 Apr 2024). In all instantiations, LSEMba combines the computational efficiency and expressive selectivity of the vanilla Mamba SSM with mechanisms for local context modeling, yielding models with strictly linear complexity in sequence (or graph) size and improved empirical fidelity on tasks requiring both long- and short-range context integration.

1. Mathematical Foundations: From Mamba SSM to Local State Evolution

The core of LSEMba is rooted in the continuous-time state-space formalism introduced in Mamba (Gu et al., 2023), where for observable x(t)\mathbf{x}(t), hidden state h(t)\mathbf{h}(t), and output y(t)\mathbf{y}(t),

ddth(t)=Ah(t)+Bx(t),y(t)=Ch(t).\frac{d}{dt}\mathbf{h}(t) = \mathbf{A} \mathbf{h}(t) + \mathbf{B} \mathbf{x}(t), \quad \mathbf{y}(t) = \mathbf{C} \mathbf{h}(t).

Discretization with step size Δ\Delta yields

ht=exp(ΔA)ht1+ΔA1(exp(ΔA)I)Bxt,yt=Cht.\mathbf{h}_t = \exp(\Delta \mathbf{A}) \mathbf{h}_{t-1} + \Delta \mathbf{A}^{-1} (\exp(\Delta \mathbf{A}) - \mathbf{I}) \mathbf{B}\, \mathbf{x}_t, \quad \mathbf{y}_t = \mathbf{C} \mathbf{h}_t.

Input conditioning is achieved by parameterizing Bt,Ct,Δt\mathbf{B}_t, \mathbf{C}_t, \Delta_t as functions of xt\mathbf{x}_t, yielding a selective, content-aware state evolution at each timepoint. This mechanism equips the model with dynamic gating capabilities, where the recurrence can adaptively remember or forget past states based on the current input.

However, the fundamental SSM recurrence is one-dimensional and unidirectional: modeling dependencies along only a single sequence or layer axis, and inherently limited in explicitly capturing two-dimensional spatial locality (for images) or local multi-hop topology (for graphs). The LSEMba concept addresses this limitation by integrating local context aggregation (via convolutions, windowing, or neighborhood-specific recurrences) with the SSM backbone.

2. LSEMba Mechanisms Across Data Modalities

2.1. Graph Structured Data (GNNs)

In (He et al., 10 Nov 2025), LSEMba is introduced as a dual-stage module for deep graph neural networks:

  • Multi-hop neighborhood aggregation via standard GCN propagation,

X(l)=D~1/2A~D~1/2X(l1),\mathbf{X}^{(l)} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2} \mathbf{X}^{(l-1)},

generating a sequence of node representations for each hop depth.

  • Node-specific selective state evolution: for each node ii, assemble the depthwise sequence Si=[xi(0),...,xi(L)]S_i = [\mathbf{x}_i^{(0)}, ..., \mathbf{x}_i^{(L)}] and apply a Mamba-style SSM scan, where parameters of the local SSM, namely (PF,QF,RF,ΔF)(\mathbf{P}^F, \mathbf{Q}^F, \mathbf{R}^F, \Delta^F), are produced per-node through small neural networks on SiS_i. The resulting recurrence

htF=PˉtFht1F+QˉtFx(t),yi,tF=RtFhtF\mathbf{h}_t^F = \bar{\mathbf{P}}_t^F \mathbf{h}_{t-1}^F + \bar{\mathbf{Q}}_t^F \mathbf{x}^{(t)}, \qquad \mathbf{y}_{i,t}^F = \mathbf{R}_t^F \mathbf{h}_t^F

selectively propagates, forgets, and integrates features across graph layers, effectively mitigating over-smoothing by preserving node-specific representation dynamics.

2.2. Video and Spatiotemporal Data

In the micro-gesture recognition context (Li et al., 12 Oct 2025), LSEMba (branded as MSF-Mamba) extends the SSM backbone with a local, motion-aware spatiotemporal fusion module:

  • Local 3D State Aggregation: After bidirectional SSM encoding, the 1D sequence is reshaped into a video-shaped tensor FRd×T×H×WF \in \mathbb{R}^{d \times T \times H' \times W'}.
  • Motion Extraction: Central Frame Difference (CFD) computes a local motion signal,

Dt=Ft12(Ft1+Ft+1),t=2,...,T1.D_t = F_t - \tfrac12 (F_{t-1} + F_{t+1}), \qquad t=2,...,T-1.

  • Local Fusion: For each window scale kk of radius rkr_k, a 3D aggregation kernel WkW_k is applied to both FF and DD:

Sk(X)t=(δτ,δh,δw)NkWk(δτ,δh,δw)Xt+δτ,h+δh,w+δw,\mathcal{S}_k(X)_t = \sum_{(\delta\tau, \delta h, \delta w) \in \mathcal{N}_k} W_k(\delta\tau, \delta h, \delta w) X_{t+\delta\tau, h+\delta h, w+\delta w},

yielding fused outputs F(k)=Sk(F)+θkSk(D)F^{(k)} = \mathcal{S}_k(F) + \theta_k \mathcal{S}_k(D), with θk\theta_k a learned gate.

  • Multiscale Adaptivity (MSF-Mamba+^+): Outputs from different receptive field widths are stacked and weighted via an adaptive scale weighting module (ASWM), producing a final feature map that emphasizes the most relevant local context per spatial–temporal location.

2.3. Vision: Image Fusion and Dense Prediction

LE-Mamba (Cao et al., 14 Apr 2024) implements LSEMba for vision by interleaving local and global SSM scans. Each "Local-Enhanced Vision Mamba (LEVM)" block partitions the feature map into windows, applies an SSM block (SS2D Mamba) to each window, then merges and globally scans the entire map. This aggressively integrates fine spatial detail without sacrificing global sequence modeling.

An additional “state sharing” mechanism propagates both local and global SSM states between adjacent and skip-connected layers, and includes a spectral–spatial learning (S2L) step that explicitly ties hidden SSM states to the feature tensor, enhancing representational richness and preventing degradation over depth.

3. Implementation Pattern and Pseudocode

The canonical LSEMba implementation for a sequence or graph-structured pipeline comprises:

  1. Context sequence extraction: For graphs, via standard GNN propagation; for video/images, via patch embedding and SSM scan.
  2. Sequence-wise selective state evolution: Per context unit (node or pixel/patch), apply a discretized, parameter-conditioned SSM scan (vanilla Mamba recurrence).
  3. Local context fusion: Aggregate neighboring context via 3D convolution, windowed state scan, or explicit local SSMs.
  4. Gating and adaptivity: Apply learned gates or attention across local contexts or scales.
  5. Final projection and prediction: Head network for downstream tasks.

A high-level algorithmic sketch for the GNN application (He et al., 10 Nov 2025):

1
2
3
4
5
6
7
8
9
10
for i in range(N):
    S_i = [X[i, :] for X in X_layers] # (L+1, d)
    Q_i, R_i, Delta_i = MLP_1(S_i), MLP_2(S_i), MLP_3(S_i)
    P_bar = exp(Delta_i * P_i)
    Q_bar = (Delta_i * P_i)^(-1) * (exp(Delta_i * P_i) - I) * Q_i * Delta_i
    h = 0
    for t in range(1, L+1):
        h = P_bar[t] @ h + Q_bar[t] @ x_i[t]
        y_i = R_i[t] @ h
    Y[i, :] = y_i
Complexity for all instantiations is O(n)O(n), where nn is the sequence length or number of graph hops.

4. Architectural Variants and Design Trade-offs

LSEMba architectures adapt to the modality and task through the design of their local modules and scale hierarchies:

Variant Local Mechanism Multiscale/ASWM Domain Parameters
MSF-Mamba 3D conv + motion (CFD) Optional Video 8.9–91.6M
MSF-Mamba+^+ 3D conv + ASWM Yes Video 24.9–235.3M
LEVM/LE-Mamba Windowed SSM + global Yes (U-Net) Image fusion sub-1M
GNN LSEMba Depthwise SSM N/A Node representation N/A

Patch sizes, SSM depth, and hidden dimensions are chosen following empirical trade-offs between computational cost, local context size, and desired parameter count. For example, MSF-Mamba+^+ uses patch size 16×1616\times16, up to 32 SSM layers, and combines 3×3×33\times 3\times 3 to 7×7×77\times 7\times 7 cubes for local fusion.

5. Theoretical Properties and Expressivity

  • The selective SSM recurrence underlying LSEMba is mathematically equivalent to a depthwise convolution with content- and depth-conditioned kernels, with provable stability and rank-preserving guarantees.
  • Input-dependent state evolution prevents uniform smoothing and promotes discriminability across deep stacks, specifically countering the over-smoothing phenomenon in deep GNNs (He et al., 10 Nov 2025).
  • For sequence and vision tasks, gating and dynamic fusion provide adaptivity beyond static kernels, yielding expressivity on par with attention in O(n)O(n) time, not O(n2)O(n^2).

6. Empirical Performance and Benchmarking

On SMG and iMiGUE datasets:

Model SMG Top-1 iMiGUE Top-1
VideoMamba (SSM) 53.3 58.1
LSEMba-Small 54.7 60.3
LSEMba+^+-Small 56.2 61.2

MSF-Mamba variants outperform CNNs and Transformers while retaining O(n)O(n) efficiency; the lightweight LSEMba-Tiny model achieves <0.01s per clip improvement with higher accuracy.

On WV3 (pansharpening), LE-Mamba attains SAM=2.76, ERGAS=2.02, surpassing MTF-GLP-FS and PanMamba. On CAVE, LE-Mamba achieves PSNR≈49.86, SAM≈2.31, ERGAS≈0.70, SSIM≈0.997—consistent state-of-the-art across fusion tasks with minimal parameter overhead.

Ablation studies show that LSEMba maintains node classification accuracy above 94% on Cora, Pubmed, and Photo even for stack depths of 32 layers, unlike GCNs whose performance deteriorates sharply (to 24% on Photo at 32 layers).

7. Limitations, Applications, and Prospective Research

LSEMba modules are strictly linear in time and memory but incur higher FLOPs compared to pure convolution due to per-token/patch parameterizations. Local modules are task-adaptive (e.g., 3D conv for video, windowed SSM for images, depthwise SSM for GNNs) and amenable to further scaling or integration with global attention; more aggressive local mixing (e.g., window shifting) remains an open direction. Empirically, LSEMba is well-suited for any structured prediction task requiring simultaneous long-range propagation and fine local detail: micro-gesture recognition, hyperspectral fusion, and deep node encoding in graphs. Widespread adoption has been spurred by the balance of cost, selectivity, and state-of-the-art performance as documented across published results.

A plausible implication is that LSEMba’s principles—content-aware selectivity plus adaptive local context fusion—will find further generalization and cross-modal integration in future efficient AI backbone architectures.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Local State-Evolution Mamba (LSEMba).