CMI-MTL: Cross-Mamba Interaction in MTL

Updated 10 November 2025

CMI-MTL is a state-space modeling framework that replaces quadratic cross-attention with linear, gating-based cross-mamba interactions for multi-task scenarios.
It employs task-specific towers and shared cross-mamba modules to fuse features efficiently through input-adaptive gating and residual connections.
Applied in Med-VQA, CMI-MTL integrates contrastive, classification, and generative losses to achieve state-of-the-art performance in cross-modal tasks.

Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) denotes a class of architectures leveraging state-space modeling—specifically the Mamba paradigm—for efficient, adaptive, long-range, and cross-domain communication in multi-task learning. Initially advanced for dense vision tasks through Cross-Task Mamba (CTM) blocks, this methodology has recently been extended to multimodal settings, as evidenced by CMI-MTL's application to medical visual question answering (Med-VQA). The central principle is the replacement of quadratic-complexity cross-attention with linear state-space models combined with input-dependent gating, enabling both parameter/shared-context flexibility and scaling to large input sizes.

1. Foundations of Cross-Mamba Interaction in Multi-Task Learning

Cross-Mamba Interaction is built on state-space sequence modeling that achieves both intra-stream (e.g., within-task or unimodal) and cross-stream (e.g., multi-task or multimodal) feature propagation. In this paradigm, traditional multi-head attention, which incurs quadratic computational cost in sequence length or spatial map size, is supplanted by two-dimensionally extended state-space models (SSMs) with $O(N)$ run time. The interaction is selectively modulated by learned, input-adaptive gating at the feature or token level.

The archetypal pipeline involves:

Task-specific towers: Each task processes its features through stacks of "self-task" Mamba blocks (vertical separation).
Shared cross-mamba (CTM/CIFR) modules: Periodically, all task or modality features are jointly aggregated, selectively fused, and re-injected, ensuring both long-range context and cross-task/intermodal transferability.

2. Module Dissection: CMI-MTL for Medical Visual Question Answering

Recent work proposes the CMI-MTL framework (Jin et al., 3 Nov 2025) for Med-VQA, adopting cross-mamba principles for handling cross-modal vision-language tasks. CMI-MTL is architected in three principal modules:

2.1 Fine-Grained Visual-Text Feature Alignment (FVTA)

Visual features are extracted via a ViT backbone; textual features from RoBERTa.
A question-aware Q-former (QQ-Former) generates $K$ learnable queries.
A two-step self-attention (over queries) and cross-attention (queries $\leftrightarrow$ image patches) pipeline aligns question context to visual embeddings.
Training is driven by a cross-modal contrastive loss (CMCL), defined as

$\mathcal{L}_{\rm vtc} = \mathcal{L}_{\rm v2t} + \mathcal{L}_{\rm t2v}$

where similarities are computed by maximal cosine alignment of queries versus pooled text.

2.2 Cross-Modal Interleaved Feature Representation (CIFR)

FVTA outputs queries $Z \in \mathbb{R}^{K \times d}$ and text tokens $T \in \mathbb{R}^{N_t \times d}$ . Both undergo LayerNorm.
The cross-mamba block constructs two intertwined streams:

$Z^{\to T} = \mathrm{Fus}(\mathrm{Mamba}(Z', Z' \odot T')) + Z'$

$T^{\to Z} = \mathrm{Fus}(\mathrm{Mamba}(T', T' \odot Z')) + T'$

with $\odot$ denoting element-wise multiplication (cross-modal gating), and $\mathrm{Fus}$ a 1x1 projection.

Two blocks are stacked, and the final fused representation is

$\mathbf x^{\rm f} = [Z^{(2)\to T} ; T^{(2)\to Z}] \in \mathbb{R}^{(K+N_t) \times d}$

Relative to cross-attention, this method eschews dot-product queries for linear state-propagation along the token dimension, preserving sequential information and favoring input-dependent, element-wise modulation.

2.3 Free-Form Answer-Enhanced Multi-Task Learning (FFAE)

The fused $\mathbf x^{\rm f}$ is pooled and classified as open/closed (classification loss).
For open-ended questions, $\mathbf x^{\rm f}$ is input to a T5 decoder for answer generation, trained under a cross-entropy objective with masking.
The total loss is a weighted sum:

$\mathcal{L} = \mathcal{L}_{\rm cls} + \alpha\,\mathcal{L}_{\rm vtc} + \beta\,\mathcal{L}_{\rm aux}$

with $\alpha=0.2$ , $\beta=0.3$ .

3. Mathematical Underpinnings and Algorithmic Workflow

The Cross-Mamba block generalizes to any two token streams $u$ , $v$ , as follows:

Gating: $g = u \odot v$
State-Space Propagation: $h = \mathrm{Mamba}(u, g)$
Fusion: $\mathrm{Fus}(h)$
Residual: Add back to $u$ (or $v$ ), depending on iteration.

During training, batches from the datasets (e.g., SLAKE, VQA-RAD, OVQA) are processed through FVTA, CIFR (with two cross-mamba blocks), followed by FFAE multi-task heads. The joint loss is backpropagated, with Adam optimizer at learning rate $5 \times 10^{-6}$ , batch size $B=8$ , and $50$ epochs.

Algorithm Skeleton:

for epoch in range(50):
    for minibatch in dataloader:
        V = ViT(images)
        T = RoBERTa(questions)
        Z = QQ-Former(Z0, V)
        L_vtc = contrastive_loss(Z, pool(T))
        Z, T = LayerNorm(Z), LayerNorm(T)
        for _ in range(2):  # Two cross-mamba blocks
            Z = Fus(Mamba(Z, Z * T)) + Z
            T = Fus(Mamba(T, T * Z)) + T
        x_f = concat(Z, T)
        y_pred = Classifier(Pool(x_f))
        L_cls = classification_loss(y_pred, y_true)
        if open_ended:
            a_pred = T5(x_f, mask)
            L_aux = generation_loss(a_pred, gt)
        L = L_cls + alpha*L_vtc + beta*L_aux
        L.backward()
        optimizer.step()

4. Comparative Analysis with Prior Cross-Mamba and CTM Paradigms

The Cross-Mamba strategies in CMI-MTL generalize prior CTM patterns observed in dense scene multi-task settings such as MTMamba (Lin et al., 2024) and MTMamba++ (Lin et al., 2024). CTM modules in MTMamba process $T$ task-specific feature maps $\{F^t\}_{t=1}^T$ as:

LayerNorm on each $F^t$
Build concatenated, global feature, also LayerNorm'ed
Both per-task and joint (global) features are passed through the same SSM-based extractor
Adaptive gating modulates task-specific and shared features
Final features are updated by residual addition of linearly fused outputs

Empirically, on NYUDv2 and PASCAL-Context:

MTMamba achieves mIoU $55.82$ (+0.98), RMSE $0.5066$ (best), mErr $18.63^\circ$ (best), odsF $78.70$ (+0.50) and improvements of $+2.08$ , $+5.01$ , $+4.90$ on three PASCAL-Context tasks over best prior methods (Lin et al., 2024).
Ablations confirm CTM's additive effect over STM-only baselines.

In MTMamba++, CTM is refined into Feature-level (F-CTM) and Semantic-aware (S-CTM) forms, employing dynamic gating and cross state-space layers, respectively. S-CTM, employing cross-SSM layers that intertwine task and shared semantic streams with multi-directional 2D scanning, achieves the highest ablation improvement ( $\Delta_m=+4.82\%$ , NYUDv2) (Lin et al., 2024).

5. Advantages, Differences from Attention, and Computational Considerations

Key distinctions between Cross-Mamba interaction and prevailing approaches:

Complexity: Mamba-style SSMs operate in $O(N)$ (with $N$ tokens or spatial positions) rather than $O(N^2)$ .
Fusion Mechanism: Gating is input-dependent and element-wise, rather than static or softmax-weighted as in attention.
Information Flow: Alternating intra-stream and cross-stream blocks ensures controlled, stage-wise integration of private and shared contexts.
Parameter Efficiency: Gating and 1x1 projections mitigate model size, especially in high-dimensional multitask/multimodal setups.

A plausible implication is that for data with large sequence/spatial dimensions or with many tasks/modalities, cross-mamba approaches scale more gracefully than transformers, while capturing richer, input-adaptive cross-context.

6. Empirical Outcomes and Interpretability

In CMI-MTL (Jin et al., 3 Nov 2025), experiments on Med-VQA benchmarks demonstrate consistent state-of-the-art performance. Interpretability analysis via Grad-CAM indicates that:

In closed-ended (classification) queries, model focus aligns closely with disease-relevant image regions.
For open-ended (generative) queries, the model localizes subtle pathologies, especially when answer supervision is provided.
Compared to prior models, cross-mamba guided features produce more focused and clinically plausible heatmaps.

Such results, consistently observed across vision-language and dense vision tasks, support the effectiveness of cross-mamba as a unifying principle for multi-task and multimodal representation learning.

7. Broader Implications and Prospective Directions

Cross-Mamba interaction—instantiated as task-task CTM in vision and cross-modal blocks in VQA—offers a template for cross-domain information flow wherever sequence length, spatial extent, or the number of tasks preclude traditional attention mechanisms. As SSM advances continue and as datasets grow in both modality and task diversity, the modularity of cross-mamba fusion, input-adaptive gating, and residual/stateful blending provides a robust and scalable multi-tasking backbone. Future work may explore its extension to non-visual modalities, hierarchical task structures, and online/streaming settings requiring linear-time inference.