Hilbert-Mamba Cross-Attention (HMCA)

Updated 5 January 2026

HMCA is a cross-modal fusion mechanism that integrates 3D imaging data by preserving spatial locality using Hilbert curve-based flattening.
It combines Hilbert ordering, the Mamba State Space Model, and dot-product cross-attention to enable robust multimodal feature interaction for precise diagnostics.
Empirical results on benchmarks, such as BraTS2021, demonstrate significant improvements in segmentation quality and diagnostic accuracy.

Hilbert-Mamba Cross-Attention (HMCA) is a cross-modal fusion mechanism introduced for robust integration of multimodal 3D medical data in vision-language architectures. Designed as a core innovation within the HilbertMed-SAM component of the Hilbert-VLM framework, HMCA addresses the challenge of preserving both fine-grained spatial detail and global context when processing volumetric data, especially in medical diagnostic tasks where spatial locality is paramount. By combining Hilbert curve-based flattening with the Mamba State Space Model (SSM) and standard dot-product cross-attention, HMCA achieves efficient, locality-aware feature interaction at scale, with demonstrated empirical benefits in segmentation and diagnosis accuracy (Wu et al., 30 Dec 2025).

1. Formal Definition and Mathematical Structure

HMCA is structurally a two-stage cross-attention block operating on pairs of 3D feature maps $F^1, F^2 \in \mathbb{R}^{C \times D \times H \times W}$ , where each can represent a different imaging modality (e.g., T1 and FLAIR MRI). Its workflow is characterized by:

Hilbert Flattening/Unflattening: Each 3D position $(x, y, z)$ is mapped to a 1D index along a Hilbert space-filling curve $h = \mathcal{H}_3(x, y, z)$ , flattening the volume to a sequence that preserves spatial locality. The inverse unscan places processed tokens back into their original 3D coordinates.
Mamba SSM Contextualization: The flattened sequence is processed via the Mamba SSM for local and global sequence modeling, utilizing the recurrence $h_t = A h_{t-1} + B s_t$ , $y_t = C h_t + D s_t$ .
Dot-Product Cross-Attention: Standard queries ( $Q$ ), keys ( $K'$ ), and values ( $V'$ ) are computed and fused using scaled dot products.

The stepwise computation is as follows:

Linear projections:

$Q = W_q \cdot \text{reshape}(F^1) \in \mathbb{R}^{d \times N}$

$K = W_k \cdot \text{reshape}(F^2) \in \mathbb{R}^{d \times N}$

$V = W_v \cdot \text{reshape}(F^2) \in \mathbb{R}^{d \times N}$

with $N = D \cdot H \cdot W$ .

Hilbert-Mamba Contextualization:

$K' = \text{MLP}(\text{HMB}(K))$

$V' = \text{MLP}(\text{HMB}(V))$

Cross-Attention:

$A = \text{softmax}\left(\frac{Q^\top K'}{\sqrt{d}}\right) \in \mathbb{R}^{N \times N}$

$C = V' A^\top$

Refinement and Output:

$C' = \text{MLP}(\text{HMB}(C))$

$F^1_{out} = F^1 + \text{reshape}(C')$

The full update can be summarized in compact notation as: $F^1_{out} = F^1 + \text{unscan}_{\overrightarrow{Hd \to 3D}} \left[ \text{MLP} \left(\text{HMB}(\text{MLP}(\text{HMB}(K, V)) \otimes \text{softmax}(Q^\top K' / \sqrt{d})) \right) \right]$

2. Integration in Multimodal 3D Medical Architectures

HMCA is deployed at multiple stages within the HilbertMed-SAM encoder, where it fuses feature representations from modality-specific SAM encoder outputs. At each encoder stage—corresponding to different spatial scales due to progressive down/upsampling—one HMCA block integrates the paired feature cubes. This multi-scale application ensures that fusion benefits both low-level fine structures and high-level semantic context.

After memory-infused modules, the dual-path decoders use related Hilbert-Mamba Blocks for self-fusion, but HMCA itself is reserved for cross-modal fusion only. In the VLM pipeline’s prompt enhancement module, HMCA merges visual and textual tokens, producing an enhanced prompt for downstream disease classification.

3. Role in Prompt Enhancement and VLM Fusion

The prompt enhancement module unifies segmentation mask encodings and associated textual labels using HMCA. Specifically:

Visual tokens $V_{vis} = \text{VisEnc}(\text{segmentation\_mask}) \in \mathbb{R}^{d \times N_{vis}}$ ,
Textual tokens $T_{txt} = \text{TextTok}("Lesion:\ldots") \in \mathbb{R}^{d \times N_{txt}}$ .

HMCA uses text-token queries and visual-token keys/values, yielding a cross-modal embedding $R_{enhanced}$ : $R_{enhanced} = \text{HMCA}(T_{txt}, V_{vis})$ This fused vector is prepended to the VLM’s text input during inference and training, with the composite loss combining cross-entropy and a consistency constraint on prompt representations.

A plausible implication is that fusing visual and textual modalities at this level provides improved semantic alignment for VLM-based diagnoses, leveraging both spatial and contextual signals (Wu et al., 30 Dec 2025).

4. Computational Complexity and Scalability Considerations

Standard transformer cross-attention on 3D volumes is computationally prohibitive, scaling as $\mathcal{O}(N^2 d)$ . HMCA addresses this by:

Employing Hilbert flattening/unflattening with $\mathcal{O}(N)$ complexity and ensuring spatially local tokens are grouped in the sequence,
Executing Mamba SSM steps at $\mathcal{O}(N d)$ ,
Using block-local dot-product attention in the cross-fusion step, enabled by Hilbert ordering, leading to an empirical scaling near $\mathcal{O}(N d + B^2 d)$ per block, where $B$ is a much smaller block size (e.g., $B=32^3 \ll N$ ).

For full-resolution 3D images (e.g., $128^3$ ; $N \approx 2 \times 10^6$ ), this approach allows tractable, locality-aware attention on modern hardware.

5. Empirical Impact and Contributions

In experimental evaluation on the BraTS2021 benchmark, replacing standard SAM2 with HilbertMed-SAM—including HMCA and scale-aware decoding—improved Dice score from approximately 78% to 82.35%. For VLM-based diagnostic classification, inserting HMCA in the prompt module boosted accuracy from Qwen-VL’s 67.31% to 78.85%. Qualitative results demonstrate that HMCA-based fusion preserves fine anatomical boundaries and small lesions more effectively and reduces false positives in challenging cases (Wu et al., 30 Dec 2025).

Notably, there is no standalone ablation for removing HMCA, but these end-to-end performance gains suggest its key role in enhancing both perception and cognition stages.

6. Summary Table: Key Operations in HMCA

Stage	Operation	Complexity
Hilbert flatten/unflatten	3D ↔ 1D scan/unscan (locality-preserving)	$\mathcal{O}(N)$
Mamba SSM	1D sequence modeling	$\mathcal{O}(N d)$
Dot-product attention	Block-local or global fusion	$\mathcal{O}(N^2 d)$ or $\mathcal{O}(B^2 d)$

7. Significance and Context

HMCA introduces principled sequence modeling into volumetric attention by combining Hilbert ordering for spatial locality, Mamba SSM for efficient context propagation, and classic cross-attention for modality fusion. The mechanism is particularly suited for large 3D biomedical datasets, where both anatomical precision and computational efficiency are prioritized. The use of Hilbert curves is instrumental in translating 3D spatial relations into the 1D processing domain with minimal loss of neighborhood information. This approach bridges the gap between efficient global context modeling and preservation of intricate structural details in vision-language diagnostics (Wu et al., 30 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hilbert-Mamba Cross-Attention (HMCA).