SSM-Based Fusion: Multi-Modal Integration

Updated 30 January 2026

SSM-Based Fusion is a paradigm that integrates multi-modal data using state-space formulations to capture global context and long-range dependencies.
It employs mechanisms such as dual-path, shared-parameter, and multi-scale fusion to dynamically mix outputs and ensure interpretable state evolution.
Applications span infrared-visible image fusion, remote sensing, and collaborative perception, offering linear computational complexity and enhanced hardware efficiency.

State Space Model (SSM)-Based Fusion refers to a paradigm in multi-modal, multi-sensor, and multi-scale data integration where fused representations are constructed using structured state space models—often leveraging neural parameterizations such as Mamba or other selective state evolution mechanisms. In contrast to conventional fusion based on CNNs, Transformers, or purely attention-based frameworks, SSM-based fusion exploits the global context-capturing ability of continuous/discrete state recurrences, enabling efficient modeling of long-range dependencies, hardware-aligned parallelism, and interpretable fusion kernels. SSM-based fusion architectures have demonstrated significant performance improvements across diverse application domains, including infrared–visible and multispectral image fusion, collaborative perception in autonomous systems, remote sensing, medical forecasting, and parameter estimation in sensor networks.

1. Core Principles and Mathematical Foundation

At the heart of SSM-based fusion is the linear (often time-varying) state-space formulation. In continuous-time, the system is described by: $\frac{d}{dt} h(t) = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t),$ where $x(t)$ encodes the input features (from one or more modalities), $h(t)$ is the evolving hidden state, $y(t)$ are the fused outputs, and $A$ , $B$ , $C$ , $D$ are trainable matrices or input-dependent functions. For neural and parallel execution, the system is discretized via zero-order hold (ZOH) with per-step timescale $\Delta$ : $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t + D x_t,$ where $\bar{A} = \exp(\Delta A)$ and $\bar{B} = (\Delta A)^{-1} (e^{\Delta A} - I) B$ .

Fusion is achieved by coupling state evolution across modalities, by sharing or exchanging latent kernels, or by dynamically mixing outputs from parallel SSMs across resolutions, spatial locations, or sensor streams (Ma et al., 2024, Gao et al., 2024, Shen et al., 19 Jul 2025, Sun et al., 9 Jan 2026).

2. Fusion Architectures and Mechanisms

SSM-based fusion architectures frequently employ one or more of the following mechanisms:

Dual-path parametric interaction: Separate SSM chains are constructed for each modality with cross-parameterization (e.g., exchange of output projection matrices, or shared parameter heads). A cross parameter branch decodes hidden states of one modality using the other’s parameters—effectively a form of cross-attention realized by a linear recurrent scan (Shen et al., 19 Jul 2025).
Shared-parameter interaction: Modalities are aligned via a joint embedding that produces a common set of SSM parameters, enforcing semantic similarity and global context (Shen et al., 19 Jul 2025).
Multi-scale/state fusion: Parallel SSMs at different temporal or spatial resolutions process the sequence, with dynamic fusion via trainable scale-mixers (linear or softmax gating). This enables simultaneous modeling of fine-grained and coarse dependencies (Karami et al., 29 Dec 2025, Gao et al., 2024).
Spatial-state fusion: In visual tasks, structure-aware state fusion involves learned dilated convolutions that mix latent states across neighboring spatial locations, connecting SSM recurrences with local context (Xiao et al., 2024).
Adaptive gating and difference-driven fusion: Modality-specific feature discrepancy maps direct the flow of information, weighting state evolution by the degree of inter-modal difference, thus focusing fusion on salient regions (Sun et al., 9 Jan 2026).

3. Application Domains

SSM-based fusion has seen adoption in numerous advanced applications:

Multi-modal image fusion: Infrared–visible, RGB-depth, multispectral, and hyperspectral pansharpening leverage SSM blocks for improved spatial–spectral information integration, yielding state-of-the-art results in metrics such as MI, VIF, AG, SSIM, and mAP (Ma et al., 2024, Cao et al., 2024, Wu et al., 23 Sep 2025, Shen et al., 19 Jul 2025).
Remote sensing classification: Multi-scale spatial and spectral Mamba blocks are used to extract and fuse features from hyperspectral, LiDAR, and SAR data in joint classification and segmentation tasks (Gao et al., 2024, Peng et al., 2024).
Biomedical sequential forecasting: Neural SSM fusion integrates continuous glucose monitoring and wearable activity data, enabling short-term forecasting, interpretable variable selection, lag-importance attribution, and counterfactual reasoning (Isaac et al., 5 Oct 2025).
Collaborative perception and multi-agent systems: Spatial and temporal SSM blocks are instantiated for cross-agent feature sharing, with history-aware boosting and agent-to-agent fusion realized by linear scan complexity (Li et al., 2024).
Error detection in surgery videos: Selective SSMs with fine-to-coarse temporal fusion and bottlenecked recurrences capture surgical events for automated error localization (Xu et al., 2024).
Sensor network self-calibration: Separable SSM likelihoods enable local filtering and scalable belief propagation for latent parameter estimation, supporting efficient multi-sensor fusion (Uney et al., 2017).

4. Computational Complexity and Accelerated Fusion

SSM-based fusion is distinguished by its linear computational complexity ( $O(NL)$ for sequence length $L$ , state size $N$ ), in contrast to transformers which scale quadratically. Multi-scale SSM fusion (e.g., (Karami et al., 29 Dec 2025)) adds a multiplicative factor corresponding to the number of scales, while spatial fusion via dilated convolutions remains a constant overhead per location (Xiao et al., 2024).

Dedicated SSM accelerators benefit from fine-grained operator fusion, which reduces on-chip SRAM and off-chip memory requirements, transforming memory-bound operation into compute-bound throughput. Memory-aware scheduling and streaming data-locality yield up to $4.8\times$ speedup over unfused execution and $1.78\times$ improvement over prior MARCA accelerators at fixed area allocation (Geens et al., 24 Apr 2025).

5. Interpretability, Generalization, and Evaluation

SSM-based fusion affords high interpretability due to explicit state evolution, variable selection (e.g., VSN-based fusion weights), lag-importance attributions (unrolled causal convolution kernels), and separable likelihood representations in sensor networks.

Generalization is observed across modalities, scales, and fusion tasks, with plug-and-play SSM blocks readily extendable to new domains (e.g., audio spectrogram fusion, graph/point-cloud mixing, multi-agent dialog) (Xiao et al., 2024, Ma et al., 2024, Shen et al., 19 Jul 2025).

Experimental results validate significant task performance improvements, superior memory efficiency, and robust long-range modeling:

Enhanced scores in multi-modal object detection, image classification, and segmentation (Gao et al., 2024, Shen et al., 19 Jul 2025).
Linear scaling enabling practical deployment on edge hardware, collaborative networks, and high-resolution imagery.
Effective downstream transfer in error detection, emotion recognition, and counterfactual forecasting (Xu et al., 2024, Shou et al., 2024, Isaac et al., 5 Oct 2025).

6. Representative Models and Key Technical Variants

	Model/Method	Modalities	Fusion Paradigm	Notable Features
S4Fusion	IR/Visible	Selective SSM + cross-modal spatial fusion	CMSA, ResNet-guided saliency loss	(Ma et al., 2024)
MS2Fusion	RGB/Thermal	Dual-path SSM (cross/shared parameters)	Joint optimization, bidirectional FF-SSM	(Shen et al., 19 Jul 2025)
MS-SSM	General	Multi-scale SSM, input-dependent mixers	Dynamic resolution fusion	(Karami et al., 29 Dec 2025)
Spatial-Mamba	Visual	Dilated SASF in state space	Structure-aware fusion, attention unification	(Xiao et al., 2024)
DIFF-MF	IR/Visible	Difference-driven SSM fusion	Channel/spatial exchange, adaptive gating	(Sun et al., 9 Jan 2026)
CollaMamba	Multi-agent	Spatial-temporal SSM fusion	History-aware boosting, cross-agent scan	(Li et al., 2024)

7. Limitations and Opportunities

Editor’s term: “SSM fusion bottlenecks” denote application-specific challenges such as:

Limited state capacity for modeling highly heterogeneous data (addressed via local enhancement and state sharing (Cao et al., 2024)).
Hardware-aware parallelism constraints, which require novel expedite transition schemes for multi-modal coupling (Li et al., 2024).
Fine-grained cross-modal token alignment, which remains less explored compared to macro-level vector fusion (Shou et al., 2024).

A plausible implication is that future research may focus on hybridizing SSM blocks with attention/kernels for context-sensitive token-wise fusion and exploring probabilistic uncertainty propagation within neural SSMs for robust fusion under noise and missing data.

References:

(Ma et al., 2024): S4Fusion: Saliency-aware Selective State Space Model for Infrared Visible Image Fusion
(Xiao et al., 2024): Spatial-Mamba: Effective Visual State Space Models via Structure-aware State Fusion
(Shen et al., 19 Jul 2025): Multispectral State-Space Feature Fusion: Bridging Shared and Cross-Parametric Interactions for Object Detection
(Karami et al., 29 Dec 2025): MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling
(Sun et al., 9 Jan 2026): DIFF-MF: A Difference-Driven Channel-Spatial State Space Model for Multi-Modal Image Fusion
(Cao et al., 2024): A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion
(Li et al., 2024): CollaMamba: Efficient Collaborative Perception with Cross-Agent Spatial-Temporal State Space Model
(Xu et al., 2024): SEDMamba: Enhancing Selective State Space Modelling with Bottleneck Mechanism and Fine-to-Coarse Temporal Fusion for Efficient Error Detection in Robot-Assisted Surgery
(Gao et al., 2024): MSFMamba: Multi-Scale Feature Fusion State Space Model for Multi-Source Remote Sensing Image Classification
(Shou et al., 2024): Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion
(Geens et al., 24 Apr 2025): Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration
(Uney et al., 2017): Latent Parameter Estimation in Fusion Networks Using Separable Likelihoods