MSF-Mamba: Motion-Aware State Fusion
- Motion-Aware State Fusion Mamba (MSF-Mamba) is a neural architecture that models both long-range and local spatiotemporal dependencies for micro-gesture recognition.
- It integrates local motion-aware state fusion with adaptive multiscale weighting to effectively capture subtle motion cues and enhance classification accuracy.
- Experimental results on SMG and iMiGUE benchmarks demonstrate improved Top-1 accuracy and efficient, linear-time performance.
Motion-Aware State Fusion Mamba (MSF-Mamba) is a neural architecture designed for efficient and accurate micro-gesture recognition (MGR) via linear-time modeling of both long-range and local spatiotemporal dependencies. Building on the state space model (SSM) framework exemplified by Mamba, MSF-Mamba introduces explicit motion-aware state fusion technology, which leverages local contextual state aggregation and dynamic motion representation to overcome the limitations of prior CNN, transformer, and vanilla SSM approaches for subtle, fine-grained motion pattern recognition.
1. Background: Challenges in Micro-Gesture Recognition
Micro-gesture recognition requires modeling subtle and local motion cues, demanding sensitivity along both spatial and temporal dimensions. Prior architectures for MGR deploy CNNs, which are proficient at capturing localized spatiotemporal patterns but limited by fixed receptive fields and an inability to propagate global dependencies. Transformer-based solutions, while capable of modeling long-range context via self-attention, suffer from quadratic complexity and computational intractability for long video sequences. Vanilla SSM approaches such as original Mamba provide linear-time scalability and can propagate global structure through state recurrences but lack local spatiotemporal modeling, as updates depend solely on previous state and ignore contextual neighborhood states.
MSF-Mamba is designed explicitly to resolve these constraints by fusing local motion-aware state representations, thus providing a compact, efficient, and accurate solution for MGR.
2. State Space Modeling and Sequential Dynamics
At its foundation, SSMs encode temporal dynamics through continuous-time ODEs for the hidden state vector and observation :
Here, are learnable matrices: controls memory decay, input projection, and output mapping. For neural deployment, the continuous system is discretized for sampled time steps using zero-order hold (ZOH):
Yielding a recurrent update:
In MSF-Mamba, bidirectional processing—forward and backward passes with averaging—ensures both past and future context in hidden state evolution.
3. Motion-Aware State Fusion: Design and Mathematical Formalization
A core innovation in MSF-Mamba is the multiscale central frame difference fusion module (MCFM), facilitating motion-aware local aggregation.
- Reshaping and Contextual Structuring: Latent SSM outputs are reshaped into to recover spatiotemporal structure.
- Central Frame Difference (CFD): For motion modeling, CFD is computed at each time index :
This operation reveals dynamic activations indicative of micro-motion between adjacent frames.
- Local State Fusion: For each fusion scale (e.g., spatiotemporal window), define operator :
where are learned weights and is the local cube neighborhood.
- Motion-Aware Output Fusion: For each scale ,
with (initialized at 0.5) a learnable scalar, blending static and dynamic cues.
4. Multiscale Extension and Adaptive Scale Weighting
MSF-Mamba introduces a multiscale architecture, MSF-Mamba, incorporating several MCFM branches covering different fusion windows (e.g., , , ). Scale-specific outputs are concatenated and adaptively weighted via a two-layer 3D convolutional network producing attention logits and softmax weights :
The final aggregated feature:
Thus, adaptive emphasis is placed on different spatial-temporal extents according to gesture dynamics.
5. Computational Complexity and Efficiency
MSF-Mamba maintains linear temporal complexity by inheriting the SSM recurrence, in contrast to quadratic costs of transformer attention. Motion-aware state fusion and adaptive multi-window weighting are efficiently integrated as local convolution operations and element-wise summations. The architecture is lightweight, comparable to prior Mamba variants, supporting fast inference and compact deployment.
6. Experimental Results
On public micro-gesture recognition benchmarks such as SMG and iMiGUE:
- MSF-Mamba (single-scale) improves Top-1 accuracy by +2.2% over VideoMamba (SSM baseline) on SMG.
- MSF-Mamba (multiscale) increases Top-1 accuracy by +2.9% (SMG) and +3.0% (iMiGUE) against best prior results.
- Ablation studies confirm contributions from both central frame difference fusion and adaptive multi-scale weighting.
- Efficiency metrics validate linear time complexity, supporting both high accuracy and low computational cost on long sequences.
7. Mathematical Algorithms and Loss Function
Critical equations include SSM temporal update, central frame difference, local state fusion, scale weighting, and cross-entropy classification loss:
where is the ground truth one-hot label and the predicted logit for class .
Algorithmic pseudo-steps for the Adaptive Scale Weighting Module:
- Concatenate multiscale features along channels.
- Compute attention weights via Conv3D and softmax.
- Stack all scales.
- Fuse via -weighted summation.
8. Significance and Implications
MSF-Mamba demonstrates that extending SSM-based architectures with motion-aware local state fusion surpasses conventional CNN, transformer, and vanilla SSM methods for MGR. The central innovations—local contextual aggregation, motion-difference modeling, adaptive multiscale fusion—provide robust and efficient encoding of subtle micro-gestures. While developed for MGR, this architecture conceptually generalizes to other domains requiring efficient modeling of both global and local dynamic dependencies.
A plausible implication is that further development of motion-aware state fusion modules may advantageously address challenges in dynamic scene understanding, video action recognition, and related sequential vision tasks requiring both efficiency and rich spatiotemporal semantics.