Papers
Topics
Authors
Recent
Search
2000 character limit reached

SSM-Based Fusion: Multi-Modal Integration

Updated 30 January 2026
  • SSM-Based Fusion is a paradigm that integrates multi-modal data using state-space formulations to capture global context and long-range dependencies.
  • It employs mechanisms such as dual-path, shared-parameter, and multi-scale fusion to dynamically mix outputs and ensure interpretable state evolution.
  • Applications span infrared-visible image fusion, remote sensing, and collaborative perception, offering linear computational complexity and enhanced hardware efficiency.

State Space Model (SSM)-Based Fusion refers to a paradigm in multi-modal, multi-sensor, and multi-scale data integration where fused representations are constructed using structured state space models—often leveraging neural parameterizations such as Mamba or other selective state evolution mechanisms. In contrast to conventional fusion based on CNNs, Transformers, or purely attention-based frameworks, SSM-based fusion exploits the global context-capturing ability of continuous/discrete state recurrences, enabling efficient modeling of long-range dependencies, hardware-aligned parallelism, and interpretable fusion kernels. SSM-based fusion architectures have demonstrated significant performance improvements across diverse application domains, including infrared–visible and multispectral image fusion, collaborative perception in autonomous systems, remote sensing, medical forecasting, and parameter estimation in sensor networks.

1. Core Principles and Mathematical Foundation

At the heart of SSM-based fusion is the linear (often time-varying) state-space formulation. In continuous-time, the system is described by: ddth(t)=Ah(t)+Bx(t),y(t)=Ch(t)+Dx(t),\frac{d}{dt} h(t) = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t), where x(t)x(t) encodes the input features (from one or more modalities), h(t)h(t) is the evolving hidden state, y(t)y(t) are the fused outputs, and AA, BB, CC, DD are trainable matrices or input-dependent functions. For neural and parallel execution, the system is discretized via zero-order hold (ZOH) with per-step timescale Δ\Delta: ht=Aˉht1+Bˉxt,yt=Cht+Dxt,h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t + D x_t, where Aˉ=exp(ΔA)\bar{A} = \exp(\Delta A) and Bˉ=(ΔA)1(eΔAI)B\bar{B} = (\Delta A)^{-1} (e^{\Delta A} - I) B.

Fusion is achieved by coupling state evolution across modalities, by sharing or exchanging latent kernels, or by dynamically mixing outputs from parallel SSMs across resolutions, spatial locations, or sensor streams (Ma et al., 2024, Gao et al., 2024, Shen et al., 19 Jul 2025, Sun et al., 9 Jan 2026).

2. Fusion Architectures and Mechanisms

SSM-based fusion architectures frequently employ one or more of the following mechanisms:

  • Dual-path parametric interaction: Separate SSM chains are constructed for each modality with cross-parameterization (e.g., exchange of output projection matrices, or shared parameter heads). A cross parameter branch decodes hidden states of one modality using the other’s parameters—effectively a form of cross-attention realized by a linear recurrent scan (Shen et al., 19 Jul 2025).
  • Shared-parameter interaction: Modalities are aligned via a joint embedding that produces a common set of SSM parameters, enforcing semantic similarity and global context (Shen et al., 19 Jul 2025).
  • Multi-scale/state fusion: Parallel SSMs at different temporal or spatial resolutions process the sequence, with dynamic fusion via trainable scale-mixers (linear or softmax gating). This enables simultaneous modeling of fine-grained and coarse dependencies (Karami et al., 29 Dec 2025, Gao et al., 2024).
  • Spatial-state fusion: In visual tasks, structure-aware state fusion involves learned dilated convolutions that mix latent states across neighboring spatial locations, connecting SSM recurrences with local context (Xiao et al., 2024).
  • Adaptive gating and difference-driven fusion: Modality-specific feature discrepancy maps direct the flow of information, weighting state evolution by the degree of inter-modal difference, thus focusing fusion on salient regions (Sun et al., 9 Jan 2026).

3. Application Domains

SSM-based fusion has seen adoption in numerous advanced applications:

  • Multi-modal image fusion: Infrared–visible, RGB-depth, multispectral, and hyperspectral pansharpening leverage SSM blocks for improved spatial–spectral information integration, yielding state-of-the-art results in metrics such as MI, VIF, AG, SSIM, and mAP (Ma et al., 2024, Cao et al., 2024, Wu et al., 23 Sep 2025, Shen et al., 19 Jul 2025).
  • Remote sensing classification: Multi-scale spatial and spectral Mamba blocks are used to extract and fuse features from hyperspectral, LiDAR, and SAR data in joint classification and segmentation tasks (Gao et al., 2024, Peng et al., 2024).
  • Biomedical sequential forecasting: Neural SSM fusion integrates continuous glucose monitoring and wearable activity data, enabling short-term forecasting, interpretable variable selection, lag-importance attribution, and counterfactual reasoning (Isaac et al., 5 Oct 2025).
  • Collaborative perception and multi-agent systems: Spatial and temporal SSM blocks are instantiated for cross-agent feature sharing, with history-aware boosting and agent-to-agent fusion realized by linear scan complexity (Li et al., 2024).
  • Error detection in surgery videos: Selective SSMs with fine-to-coarse temporal fusion and bottlenecked recurrences capture surgical events for automated error localization (Xu et al., 2024).
  • Sensor network self-calibration: Separable SSM likelihoods enable local filtering and scalable belief propagation for latent parameter estimation, supporting efficient multi-sensor fusion (Uney et al., 2017).

4. Computational Complexity and Accelerated Fusion

SSM-based fusion is distinguished by its linear computational complexity (O(NL)O(NL) for sequence length LL, state size NN), in contrast to transformers which scale quadratically. Multi-scale SSM fusion (e.g., (Karami et al., 29 Dec 2025)) adds a multiplicative factor corresponding to the number of scales, while spatial fusion via dilated convolutions remains a constant overhead per location (Xiao et al., 2024).

Dedicated SSM accelerators benefit from fine-grained operator fusion, which reduces on-chip SRAM and off-chip memory requirements, transforming memory-bound operation into compute-bound throughput. Memory-aware scheduling and streaming data-locality yield up to 4.8×4.8\times speedup over unfused execution and 1.78×1.78\times improvement over prior MARCA accelerators at fixed area allocation (Geens et al., 24 Apr 2025).

5. Interpretability, Generalization, and Evaluation

SSM-based fusion affords high interpretability due to explicit state evolution, variable selection (e.g., VSN-based fusion weights), lag-importance attributions (unrolled causal convolution kernels), and separable likelihood representations in sensor networks.

Generalization is observed across modalities, scales, and fusion tasks, with plug-and-play SSM blocks readily extendable to new domains (e.g., audio spectrogram fusion, graph/point-cloud mixing, multi-agent dialog) (Xiao et al., 2024, Ma et al., 2024, Shen et al., 19 Jul 2025).

Experimental results validate significant task performance improvements, superior memory efficiency, and robust long-range modeling:

6. Representative Models and Key Technical Variants

Model/Method Modalities Fusion Paradigm Notable Features Reference
S4Fusion IR/Visible Selective SSM + cross-modal spatial fusion CMSA, ResNet-guided saliency loss (Ma et al., 2024)
MS2Fusion RGB/Thermal Dual-path SSM (cross/shared parameters) Joint optimization, bidirectional FF-SSM (Shen et al., 19 Jul 2025)
MS-SSM General Multi-scale SSM, input-dependent mixers Dynamic resolution fusion (Karami et al., 29 Dec 2025)
Spatial-Mamba Visual Dilated SASF in state space Structure-aware fusion, attention unification (Xiao et al., 2024)
DIFF-MF IR/Visible Difference-driven SSM fusion Channel/spatial exchange, adaptive gating (Sun et al., 9 Jan 2026)
CollaMamba Multi-agent Spatial-temporal SSM fusion History-aware boosting, cross-agent scan (Li et al., 2024)

7. Limitations and Opportunities

Editor’s term: “SSM fusion bottlenecks” denote application-specific challenges such as:

  • Limited state capacity for modeling highly heterogeneous data (addressed via local enhancement and state sharing (Cao et al., 2024)).
  • Hardware-aware parallelism constraints, which require novel expedite transition schemes for multi-modal coupling (Li et al., 2024).
  • Fine-grained cross-modal token alignment, which remains less explored compared to macro-level vector fusion (Shou et al., 2024).

A plausible implication is that future research may focus on hybridizing SSM blocks with attention/kernels for context-sensitive token-wise fusion and exploring probabilistic uncertainty propagation within neural SSMs for robust fusion under noise and missing data.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to State Space Model (SSM)-Based Fusion.