AlignMamba: Efficient Multimodal Fusion

Updated 31 December 2025

AlignMamba is an efficient multimodal fusion architecture that injects explicit cross-modal alignment signals at both token and distributional levels.
It utilizes Optimal Transport for local token alignment and Maximum Mean Discrepancy for global distribution matching, enhancing robustness in multimodal tasks.
The method achieves state-of-the-art sentiment analysis accuracy with linear-time complexity and reduced memory footprint compared to Transformer-based models.

AlignMamba is an efficient multimodal fusion architecture designed to overcome the computational and representational bottlenecks of both Transformer-based and purely sequential State Space Model (SSM)-based approaches. The core innovation of AlignMamba is the explicit injection of cross-modal alignment signals—both at the token (local) and distributional (global) level—before leveraging a Mamba backbone for linear-time multimodal representation fusion. This dual alignment architecture allows the model to capture temporally misaligned or semantically disparate cross-modal signals while maintaining strict efficiency guarantees, achieving state-of-the-art accuracy and robustness on benchmarks for sentiment analysis and related multimodal tasks (Li et al., 2024).

1. Motivation and Problem Formulation

Multimodal representation fusion tasks, such as sentiment analysis on datasets like CMU-MOSI and CMU-MOSEI, require integration of highly heterogeneous and often temporally misaligned modalities (e.g., video, audio, text). Transformer-based models offer rich cross-modal interaction but suffer from $O(N^2)$ complexity, making them impractical for long or large-scale sequences. Mamba—based on SSMs with a time-priority selective scan—addresses computational cost (scaling as $O(N)$ per layer), but its strictly sequential processing fails to align and bind signals that are offset in time across modalities (e.g., a facial cue lagging or leading its linguistic referent).

AlignMamba targets these two challenges by introducing explicit cross-modal alignment modules that realize token-level matchings and global distribution matching prior to Mamba-based fusion. This ensures that corresponding events or features across modalities can be directly connected and that distributional discrepancies are mitigated, enabling robust, scalable fusion (Li et al., 2024).

AlignMamba realizes fine-grained, token-level cross-modal alignment using an Optimal Transport (OT) framework. Let $X_v \in \mathbb{R}^{T_v \times d}$ and $X_l \in \mathbb{R}^{T_l \times d}$ denote video and language feature sequences, modeled as empirical distributions with uniform marginals $p \in \Delta^{T_v},\ q \in \Delta^{T_l}$ . The pairwise cost matrix $C \in \mathbb{R}^{T_v \times T_l}$ is defined as

$C_{ij} = 1 - \frac{\langle X_v^i, X_l^j \rangle}{\|X_v^i\|_2 \|X_l^j\|_2}$

The standard entropic-regularized OT problem is: $\min_{\pi \in \mathbb{R}_+^{T_v \times T_l}} \sum_{i,j} \pi_{ij} C_{ij} + \epsilon \sum_{i,j} \pi_{ij} (\log \pi_{ij} - 1)$ subject to marginals $\sum_j \pi_{ij} = p_i$ and $\sum_i \pi_{ij} = q_j$ . Due to overhead, AlignMamba simplifies this to a hard assignment per source token, effectively matching each video (or audio) token to its minimum-cost language token. Mathematically,

$O(N)$ 0

Aligned video and audio features on the language timeline are, respectively,

$O(N)$ 1

where $O(N)$ 2 are the assignment matrices for video $O(N)$ 3language and audio $O(N)$ 4language matchings (Li et al., 2024).

Local alignment may leave residual distributional shifts between modalities. AlignMamba introduces a global alignment loss via Maximum Mean Discrepancy (MMD) in a reproducing kernel Hilbert space (RKHS), typically using a Gaussian kernel $O(N)$ 5. The squared MMD between two feature sets $O(N)$ 6, $O(N)$ 7 is: $O(N)$ 8 The global alignment loss combined with the task loss (e.g., cross-entropy) is: $O(N)$ 9 This regularization enforces closer global feature distributions after local OT-based alignment, stabilizing multimodal representation fusion (Li et al., 2024).

4. Integration with the Mamba State Space Backbone

After local and global alignment, modalities are interleaved in a prescribed order to form the input sequence to the Mamba backbone: $X_v \in \mathbb{R}^{T_v \times d}$ 0 where $X_v \in \mathbb{R}^{T_v \times d}$ 1. This “time-priority” structure is crucial, as the SSM scan at each step processes features from each modality in sequential order, capturing intra-modal recurrence and the explicit cross-modal cues introduced by the OT alignment. A stack of $X_v \in \mathbb{R}^{T_v \times d}$ 2 Mamba layers then produces the multimodal fused representation, preserving the $X_v \in \mathbb{R}^{T_v \times d}$ 3 per-layer complexity that is central to Mamba's efficiency (Li et al., 2024).

The entire workflow can be sketched as:

Encode unimodal $X_v \in \mathbb{R}^{T_v \times d}$ 4
Compute OT cost matrices and hard-assignment alignments
Form $X_v \in \mathbb{R}^{T_v \times d}$ 5 via the assignment matrices
Interleave aligned sequences
Process with $X_v \in \mathbb{R}^{T_v \times d}$ 6 stacked Mamba layers
Read out task outputs from final sequence states

5. Experimental Results and Ablations

AlignMamba is evaluated on CMU-MOSI (2,199 utterances) and CMU-MOSEI (23,453 utterances) for binary sentiment classification. Under complete modality access, it achieves 86.9% accuracy / 86.9% F1 on MOSI and 86.6% / 86.5% on MOSEI, outperforming the prior best (MTMD) by 0.9 and 0.5 points, respectively. In incomplete fusion (10%–70% randomly masked modalities at test), AlignMamba averages 79.9% accuracy (MOSI), degrading only 11.9% across the missing rates, better than both IMDer and MMIN.

Ablation studies reveal:

Removing OT local alignment costs –2.3% (MOSI), –1.1% (MOSEI)
Removing MMD alignment costs –1.1%
Discarding both or using vanilla Mamba fusion (no explicit alignment) loses 4.6–3.2%
Language is most critical; dropping it degrades performance more than other modalities

Efficiency is also substantially improved:

At sequence length 6.4k, AlignMamba uses 8.53GB GPU memory vs. 10.7GB (single-stream Transformer) and 20.3GB (multi-stream)
Inference time is 6.05s vs. 36.13s and 48.61s (Transformers), reflecting the linear-vs-quadratic scaling (Li et al., 2024)

6. Relation to Extended Architectures and Transfer Methods

AlignMamba underpins broader techniques for SSM-based multimodal modeling:

"TransMamba" applies cross-architecture distillation, transferring knowledge from large-scale pre-trained Transformers to Mamba using aligned latent projections and adaptive bidirectional distillation (WSAB), showing improved sample efficiency and transferability across tasks including vision, VQA, and retrieval (Chen et al., 21 Feb 2025).
EMMA extends the alignment principle to large multimodal LLMs by enforcing both pixel-wise (“structural”) and multi-scale hierarchical alignment, notably improving sensitivity to visual detail and reducing hallucinations at inference (Xing et al., 2024).
Variance-aligned rotation methods (sometimes termed “AlignMamba” in quantization contexts) optimize channel-wise statistics for post-training quantization of Mamba networks, using KLT-enhanced and smooth-fused rotations to mitigate outlier amplification by SSMs (Xu et al., 23 Jan 2025).

7. Limitations, Extensions, and Future Directions

Current implementation employs hard token-to-token assignment in local OT; entropic regularization and Sinkhorn solvers may permit finer, many-to-many correspondences. Alternative kernels to Gaussian in MMD (e.g., polynomial or Laplacian) could improve performance depending on modality characteristics and downstream tasks. Dynamic weighting of local vs. global alignment could adaptively tune regularization to domain-specific correlation structures (e.g., medical imaging with weak semantic overlap).

Extensions toward longer-horizon video understanding, real-time robotics, or multi-modal registration with larger spatial deformations (as in MambaReg (Wen et al., 2024)) are natural directions. For related alignment and fusion challenges (e.g., conversation-level multimodal emotion recognition (Li et al., 2024) or attitude tuning in beamline automation (Li et al., 2024)), the explicit local/global alignment strategy of AlignMamba provides a reproducible and extensible paradigm.

In summary, AlignMamba exemplifies an efficient paradigm for multimodal fusion that unifies explicit cross-modal correspondence, distributional regularization, and scalable SSM modeling, enabling both state-of-the-art performance and practical deployment across a range of sequence lengths and multimodal settings (Li et al., 2024).