Surgical-MambaLLM: Multimodal LLM for Surgery

Updated 27 September 2025

Surgical-MambaLLM is a multimodal system that fuses the Mamba2 state-space model with InternLM-7B for enhanced visual and textual integration in surgical contexts.
It employs a novel Cross-modal Bidirectional Mamba2 Integration (CBMI) and a tailored Surgical Instrument Perception (SIP) scanning mode for precise spatial feature alignment.
Experimental results on EndoVis datasets demonstrate improved accuracy, F-score, and spatial localization, benefiting surgical education and robotic assistance.

Surgical-MambaLLM refers to a Mamba2-enhanced multimodal LLM architecture tailored for Visual Question Localized-Answering (VQLA) in robotic surgery, representing the first integration of Mamba2 with an LLM for surgical scene understanding. Its design specifically addresses the challenges of establishing complex cross-modal dependencies between rich surgical imagery and procedural language, as well as robustly perceiving and localizing spatial information in highly structured, instrument-centric surgical scenes (Hao et al., 20 Sep 2025).

1. Motivation and Conceptual Foundations

Existing Surgical-VQLA approaches leverage LLMs and vision models but struggle to form precise cross-modal associations between a question (e.g., “Which tool is near the gallbladder?”) and the spatially localized answer within cluttered surgical scenes. Limitations arise from inadequate modeling of spatial detail and weak fusion of textual-visual cues. Surgical-MambaLLM is explicitly designed to overcome these issues by:

Combining the Mamba2 state space sequence model (SSM), known for efficient long-range, linear-time modeling and spatial feature preservation, with a LLM (InternLM-7B).
Introducing a Cross-modal Bidirectional Mamba2 Integration (CBMI) module for deeply entangled multimodal fusion.
Developing a Surgical Instrument Perception (SIP) scanning mode that matches the geometric properties of laparoscopic surgical scenes: instruments move radially from image periphery toward the central target, necessitating radial–rather than raster–scanning for fine-grained spatial alignment.

These innovations enable Surgical-MambaLLM to generate accurate, interpretable answers with precise spatial localization, an essential requirement for surgical education and real-time robotic assistance.

2. Model Architecture and Key Components

The overall architecture consists of a two-stage vision-language fusion pipeline feeding an LLM, integrating structured spatial modeling via Mamba2:

Visual Encoder: A pretrained CLIP-ViT-B/32 extracts visual embeddings from input surgical images.
Textual Encoder: Tokenizes and projects the input clinical or descriptive question into feature space.
Projection Layers: Linear transformations map raw features to a shared, intermediate representation,

$F_t = l_t(t), \quad F_v = l_v(v)$

where $l_t$ and $l_v$ are text/visual projectors; $t, v$ are tokenized question and image.

CBMI Module (Cross-modal Bidirectional Mamba2 Integration):
- Employs two distinct scanning streams:
- Textual stream: 1D scan along the question sequence.
- Visual stream: SIP scan (see §4).
- Each stream is processed bidirectionally (forward and backward) using Mamba2 blocks.
- The outputs $S_\text{forward}, S_\text{backward}$ are merged:
$S = S_\text{forward} \cdot \sigma(F_v) + S_\text{backward} \cdot \sigma(F_v)$

with $\sigma$ as a nonlinearity. - Final output is normalized and linearly projected before LLM input:

$S_\text{output} = \text{Linear}(\text{LN}(S))$
LLM Backbone: InternLM-7B receives the fused encoding, producing both answer text and a spatial (typically bounding box-based) localization.

This structure allows precise synchronization between text focus and spatial image features, overcoming the bottlenecks in earlier concatenation/sum-fusion paradigms.

CBMI constitutes a fundamental mechanism for multimodal fusion in Surgical-MambaLLM:

Bidirectional Scanning: Both text tokens and spatially scanned image features are sequentially encoded using Mamba2 SSMs in forward and backward directions, capturing comprehensive dependencies.
Fusion Strategy: Unlike naive concatenation, forward and backward information flows are selectively merged and modulated using the projected visual embedding, reinforcing the specific linkage between queried tool names/concepts and their spatial manifestation.
Advantage: This selective, structured fusion leverages Mamba2’s inherent ability to maintain local feature coherence across long sequences and enhances the context fed into the LLM for subsequent reasoning and answer localization.

The adoption of CBMI directly improves the grounding of answer spans and instrument locations in the visual field compared to previous LLM-based VQLA systems.

4. Surgical Instrument Perception (SIP) Scanning Mode

SIP scanning is tailored to the domain-specific spatial geometry of surgical scenes:

Radial Scanning Pattern: Unlike conventional raster or row-wise scans, SIP initiates radial scans from the image center, proceeding outward in four directions to mirror real instrument trajectories during laparoscopic and robotic procedures.
Scanning Recurrence: The trajectory for scan-point $(x_n, y_n)$ (for a quadrant) is recursively computed as:

$(x_{n+1}, y_{n+1}) = \begin{cases} (0, y_n - k_n) & \text{if} \ y_n = N, x_n \neq N \ (x_n - k_n, 0) & \text{if} \ x_n = N \ (x_n + 1, y_n + 1) & \text{otherwise} \end{cases}$

with $k_n = \begin{cases} x_n + 1 & y_n > x_n \ y_n - 1 & y_n \leq x_n \end{cases}$ where $N$ is the maximal image index.

Spatial Awareness: By matching surgical instrument trajectories, SIP scanning allows the Mamba2 encoder to better preserve instrument continuity, greatly enhancing the spatial alignment between visual context and the query.

This domain-specific scan directly improves localization precision in answer generation for surgical VQLA.

5. Experimental Results and Comparative Performance

Evaluation on EndoVis18-VQLA and EndoVis17-VQLA datasets establishes performance gains over conventional LLM-VQA/VQLA baselines:

Dataset	Accuracy	F-Score	mIoU	Notes
EndoVis18-VQLA	0.6964	0.4110	0.8027	Outperformed state-of-the-art by all metrics
EndoVis17-VQLA	0.5191	0.4406	0.7648	Highest accuracy/F, marginally lower mIoU

Ablations demonstrate:

Both CBMI fusion and SIP scanning independently improve accuracy and F-Score.
The effect of SIP is most pronounced in scenes with complex, overlapping instruments.
Minor reduction in mIoU is attributed to challenging dataset-specific label granularity.

These results indicate robust visual reasoning, superior answer localization, and a generalizable improvement across representative robotic surgery datasets (Hao et al., 20 Sep 2025).

6. Applications, Implications, and Future Directions

Surgical-MambaLLM has immediate applications and broader implications:

Educational Assistive Tool: Provides interpretable, spatially localized answers for trainees and junior doctors, elucidating surgical context and instrument usage in complex procedures.
Robotic Surgery Support: Enhanced spatial awareness and robust cross-modal reasoning facilitate safer autonomous or semi-autonomous intervention, instrument tracking, and decision support.
Generalizability: The architecture—and especially SIP mode—can plausibly be adapted to other domains with strong, structured spatial priors in vision–language tasks.

Future work includes:

Refinement of localization accuracy, particularly on external and more heterogeneous datasets.
Broadening training data to encompass additional surgical scenarios and procedural variabilities.
Further enhancing the fusion and scanning algorithms, possibly with adaptive, content-aware scanning or by incorporating domain-specific knowledge graphs for improved language grounding.
Exploring richer forms of grounding in LLMs using reinforcement or contrastive learning in multimodal surgical contexts.

7. Context within Surgical Vision-Language Modeling

Surgical-MambaLLM exemplifies the next stage in multimodal modeling for surgery, building on foundational insights from other Mamba-based approaches in segmentation, surgical workflow recognition, and error detection (Xu et al., 2024, Cao et al., 2024, Zhang et al., 2024). Its CBMI-SIP design addresses the need for contextually grounded, spatially precise reasoning that scales efficiently to the data-intensive, structured environment of computer-assisted and robotic surgery.

Implementation of such architectures suggests a plausible research trend toward the integration of domain-specific scanning, efficient state-space modeling, and LLM reasoning as the technical backbone for future surgical intelligence systems.