Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

MSF-Mamba: Motion-Aware State Fusion

Updated 15 October 2025
  • Motion-Aware State Fusion Mamba (MSF-Mamba) is a neural architecture that models both long-range and local spatiotemporal dependencies for micro-gesture recognition.
  • It integrates local motion-aware state fusion with adaptive multiscale weighting to effectively capture subtle motion cues and enhance classification accuracy.
  • Experimental results on SMG and iMiGUE benchmarks demonstrate improved Top-1 accuracy and efficient, linear-time performance.

Motion-Aware State Fusion Mamba (MSF-Mamba) is a neural architecture designed for efficient and accurate micro-gesture recognition (MGR) via linear-time modeling of both long-range and local spatiotemporal dependencies. Building on the state space model (SSM) framework exemplified by Mamba, MSF-Mamba introduces explicit motion-aware state fusion technology, which leverages local contextual state aggregation and dynamic motion representation to overcome the limitations of prior CNN, transformer, and vanilla SSM approaches for subtle, fine-grained motion pattern recognition.

1. Background: Challenges in Micro-Gesture Recognition

Micro-gesture recognition requires modeling subtle and local motion cues, demanding sensitivity along both spatial and temporal dimensions. Prior architectures for MGR deploy CNNs, which are proficient at capturing localized spatiotemporal patterns but limited by fixed receptive fields and an inability to propagate global dependencies. Transformer-based solutions, while capable of modeling long-range context via self-attention, suffer from quadratic complexity and computational intractability for long video sequences. Vanilla SSM approaches such as original Mamba provide linear-time scalability and can propagate global structure through state recurrences but lack local spatiotemporal modeling, as updates depend solely on previous state and ignore contextual neighborhood states.

MSF-Mamba is designed explicitly to resolve these constraints by fusing local motion-aware state representations, thus providing a compact, efficient, and accurate solution for MGR.

2. State Space Modeling and Sequential Dynamics

At its foundation, SSMs encode temporal dynamics through continuous-time ODEs for the hidden state vector h(t)h(t) and observation y(t)y(t):

ddth(t)=Ah(t)+Bx(t),y(t)=Ch(t)\frac{d}{dt} h(t) = A h(t) + B x(t), \qquad y(t) = C h(t)

Here, A,B,CA, B, C are learnable matrices: AA controls memory decay, BB input projection, and CC output mapping. For neural deployment, the continuous system is discretized for sampled time steps using zero-order hold (ZOH):

Ad=exp(ΔA)A_d = \exp(\Delta A)

Bd=ΔA1(AdI)BB_d = \Delta A^{-1} (A_d - I) B

Yielding a recurrent update:

ht=Adht1+Bdxt,yt=Chth_{t} = A_d h_{t-1} + B_d x_t, \qquad y_t = C h_t

In MSF-Mamba, bidirectional processing—forward and backward passes with averaging—ensures both past and future context in hidden state evolution.

3. Motion-Aware State Fusion: Design and Mathematical Formalization

A core innovation in MSF-Mamba is the multiscale central frame difference fusion module (MCFM), facilitating motion-aware local aggregation.

  • Reshaping and Contextual Structuring: Latent SSM outputs HRn×dH \in \mathbb{R}^{n\times d} are reshaped into FRd×T×H×WF \in \mathbb{R}^{d\times T\times H'\times W'} to recover spatiotemporal structure.
  • Central Frame Difference (CFD): For motion modeling, CFD is computed at each time index tt:

Dt=Ft12(Ft1+Ft+1)D_t = F_t - \frac{1}{2} (F_{t-1} + F_{t+1})

This operation reveals dynamic activations indicative of micro-motion between adjacent frames.

  • Local State Fusion: For each fusion scale kk (e.g., 3×3×33\times3\times3 spatiotemporal window), define operator Sk()\mathcal{S}_k(\cdot):

Sk(X)t=(Δτ,Δh,Δw)NkWk(Δτ,Δh,Δw)Xt+Δτ,h+Δh,w+Δw\mathcal{S}_k(X)_t = \sum_{(\Delta\tau, \Delta h, \Delta w)\in\mathcal{N}_k} W_k(\Delta\tau, \Delta h, \Delta w) X_{t+\Delta\tau, h+\Delta h, w+\Delta w}

where WkW_k are learned weights and Nk\mathcal{N}_k is the local cube neighborhood.

  • Motion-Aware Output Fusion: For each scale kk,

F(k)=Sk(F)+θkSk(D)F^{(k)} = \mathcal{S}_k(F) + \theta_k \mathcal{S}_k(D)

with θk\theta_k (initialized at 0.5) a learnable scalar, blending static and dynamic cues.

4. Multiscale Extension and Adaptive Scale Weighting

MSF-Mamba introduces a multiscale architecture, MSF-Mamba+^{+}, incorporating several MCFM branches covering different fusion windows (e.g., 3×3×33\times3\times3, 5×5×55\times5\times5, 7×7×77\times7\times7). Scale-specific outputs are concatenated and adaptively weighted via a two-layer 3D convolutional network producing attention logits AA and softmax weights α\alpha:

αk,t,h,w=exp(Ak,t,h,w)kexp(Ak,t,h,w)\alpha_{k, t, h, w} = \frac{\exp(A_{k, t, h, w})}{\sum_{k'} \exp(A_{k', t, h, w})}

The final aggregated feature:

Ffinal(c,t,h,w)=kαk,t,h,wFstack(k,c,t,h,w)F_\text{final}(c, t, h, w) = \sum_k \alpha_{k, t, h, w} F_\text{stack}(k, c, t, h, w)

Thus, adaptive emphasis is placed on different spatial-temporal extents according to gesture dynamics.

5. Computational Complexity and Efficiency

MSF-Mamba maintains linear O(n)\mathcal{O}(n) temporal complexity by inheriting the SSM recurrence, in contrast to quadratic costs of transformer attention. Motion-aware state fusion and adaptive multi-window weighting are efficiently integrated as local convolution operations and element-wise summations. The architecture is lightweight, comparable to prior Mamba variants, supporting fast inference and compact deployment.

6. Experimental Results

On public micro-gesture recognition benchmarks such as SMG and iMiGUE:

  • MSF-Mamba (single-scale) improves Top-1 accuracy by +2.2% over VideoMamba (SSM baseline) on SMG.
  • MSF-Mamba+^{+} (multiscale) increases Top-1 accuracy by +2.9% (SMG) and +3.0% (iMiGUE) against best prior results.
  • Ablation studies confirm contributions from both central frame difference fusion and adaptive multi-scale weighting.
  • Efficiency metrics validate linear time complexity, supporting both high accuracy and low computational cost on long sequences.

7. Mathematical Algorithms and Loss Function

Critical equations include SSM temporal update, central frame difference, local state fusion, scale weighting, and cross-entropy classification loss:

L=c=1Cqclog(exp(y^c)j=1Cexp(y^j))\mathcal{L} = -\sum_{c=1}^C q_c \log\left(\frac{\exp(\hat{y}_c)}{\sum_{j=1}^C \exp(\hat{y}_j)}\right)

where qcq_c is the ground truth one-hot label and y^c\hat{y}_c the predicted logit for class cc.

Algorithmic pseudo-steps for the Adaptive Scale Weighting Module:

  1. Concatenate multiscale features F(k)F^{(k)} along channels.
  2. Compute attention weights α\alpha via Conv3D and softmax.
  3. Stack all scales.
  4. Fuse via α\alpha-weighted summation.

8. Significance and Implications

MSF-Mamba demonstrates that extending SSM-based architectures with motion-aware local state fusion surpasses conventional CNN, transformer, and vanilla SSM methods for MGR. The central innovations—local contextual aggregation, motion-difference modeling, adaptive multiscale fusion—provide robust and efficient encoding of subtle micro-gestures. While developed for MGR, this architecture conceptually generalizes to other domains requiring efficient modeling of both global and local dynamic dependencies.

A plausible implication is that further development of motion-aware state fusion modules may advantageously address challenges in dynamic scene understanding, video action recognition, and related sequential vision tasks requiring both efficiency and rich spatiotemporal semantics.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Motion-Aware State Fusion Mamba (MSF-Mamba).