SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Published 31 Mar 2026 in cs.CV | (2603.29962v2)

Abstract: Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy -- Perception, Assessment, and Reasoning -- spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces a novel multimodal LLM framework with a text-guided memory pyramid to capture both spatial details and long-range temporal context in surgical video QA.
It uses curriculum-style Surgical Competency Progression training and achieves superior performance across perception, assessment, and reasoning tasks in cholecystectomy.
Quantitative and ablation analyses demonstrate that leveraging temporal memory and cross-modal attention significantly enhances clinical assessment and safety verification.

SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-Guided Visual Memory

Motivation and Challenges in Surgical Video QA

The complexity and risk in laparoscopic cholecystectomy demand expert-level intraoperative interpretation and assessment. Surgical video QA systems have the potential to satisfy both educational and real-time support needs, yet most prior approaches are constrained to static image-based VQA, with limited temporal modeling. This neglects critical temporal semantics essential for procedure-specific assessments, such as Critical View of Safety (CVS) achievement, adverse event detection, and skill evaluation. Additionally, the domain exhibits low visual contrast, hierarchical task dependencies, and requires variable spatiotemporal granularity. These challenges necessitate novel multimodal LLM-based approaches capable of dynamic temporal understanding and domain-specific reasoning.

Architecture of SurgTEMP and Text-Guided Memory Pyramid

SurgTEMP introduces a multimodal LLM framework integrating a domain-adaptive Text-Guided Memory Pyramid (TEMP) module. The core pipeline uses SigLIP-based visual encoding, a multi-modal projector, spatial pooling, and Qwen2-7B as the language backend. The TEMP module leverages cross-modal attention to guide hierarchical memory formation—selecting spatially relevant patches and temporally salient frames based on the textual query, efficiently constructing both spatial and temporal memory banks.

Figure 1: SurgTEMP architecture highlighting multimodal feature extraction and hierarchical memory bank integration for video QA.

TEMP computation includes multi-level text-visual attention maps, Gumbel-Softmax-based differentiable frame selection, patch-level reweighting, and insertion of learnable separator tokens to encode structural boundaries. This design enables explicit modeling of both fine-grained operative details and long-range procedural context, crucial for clinical assessments across variable-length videos.

Figure 2: TEMP module processing steps: cross-modal attention, spatial memory bank construction, temporal memory bank formation.

CholeVidQA-32K Dataset and Curation

SurgTEMP is trained and evaluated on the CholeVidQA-32K dataset, which comprises 32K open-ended QA pairs and 3,855 video segments (~128 hours), segmented from CholecT50, Endoscapes, and CholeScore sources. The dataset is designed to capture the hierarchical progression of surgical competencies with three levels—Perception (tool, action, anatomy), Assessment (CVS, difficulty, adverse events, skills), and Reasoning (scene description, rationale, planning).

Figure 3: CholeVidQA-32K hierarchy showing 11 tasks mapped to Perception, Assessment, and Reasoning cognitive levels.

Figure 4: Dataset curation pipeline integrating expert annotation, prompt engineering, and stratified review for quality assurance.

Figure 5: CholeVidQA-32K composition and task distribution across hierarchy and temporal duration.

Surgical Competency Progression Training and Baseline Comparison

The Surgical Competency Progression (SCP) scheme implements curriculum-style, stage-wise training by sequentially exposing the model to perception, assessment, and reasoning tasks. Progressive sampling ensures foundational skills are maintained in later stages. Baseline comparisons include both zero-shot (mPLUG-Owl3, InternVideo2.5, LongVA, LLaVA-Video, VideoGPT+) and fine-tuned models (LLaVA-Video-ft, VideoGPT+-ft).

Figure 6: Evaluation pipeline incorporates categorical metrics, overlap metrics, and LLM-based multidimensional scores.

Quantitative Results and Ablation Analysis

SurgTEMP exhibits dominant performance across metrics: balanced accuracy, F1, BLEU, METEOR, ROUGE-L, CIDEr, and GPT-judge correctness, relevance, and linguistic quality—consistently outperforming zero-shot and fine-tuned baselines, particularly on assessment and reasoning tasks requiring long-range context and domain-specific understanding.

Figure 7: Radar chart of model performance on task hierarchies, demonstrating balanced excellence and high scores across Perception, Assessment, and Reasoning.

Ablation studies indicate substantial drops in correctness, relevance, and linguistic quality when disabling temporal memory bank, text-guided attention selection, learnable separators, and SCP, underscoring the necessity of hierarchical, text-driven visual memory and progressive training.

Figure 8: Frame sampling sensitivity: performance saturates around 64 frames, with further increases causing degradation due to token selection saturation.

Frame Selection Visualization and Qualitative Analysis

SurgTEMP's frame selection mechanism consistently identifies clinically informative segments, optimizing attention to safety-relevant events, while avoiding irrelevant out-of-body frames. Visualization corroborates alignment with expert-identified frames, supporting precise clinical assessments.

Figure 9: TEMP module frame selection—informative frames highlighted, aligning with clinical expert choices and facilitating precise reasoning.

Qualitative comparison of model outputs demonstrates that SurgTEMP generates contextually and clinically appropriate responses, excelling in perception, assessment (CVS, difficulty, adverse events, skills), and reasoning (scene description, rationale, planning) over baseline architectures.

Figure 10: SurgTEMP achieves clinically grounded answer generation for perception-level tasks compared to baseline models.

Figure 11: SurgTEMP outperforms baselines in assessment-level tasks—accurate, context-aware clinical responses.

Figure 12: SurgTEMP demonstrates comprehensive, detailed reasoning on surgical scenarios with temporal coherence.

Practical and Theoretical Implications

SurgTEMP addresses domain-specific visual interpretation, temporal granularity, and hierarchical dependencies, providing a robust foundation for intraoperative QA and educational applications. The text-guided memory architecture enables adaptive reasoning about rare, subtle, or visually ambiguous events. The scalability of the TEMP module facilitates integration with longer and more complex surgical procedures, suggesting potential for broad cross-procedural generalization and AI-driven decision support.

Abstention and uncertainty estimation are not yet implemented; advancing these directions is essential for clinical deployment. Extension to more diverse surgical domains will further validate generalization capabilities.

Conclusion

SurgTEMP delivers a state-of-the-art solution for temporal-aware surgical video question answering, leveraging text-guided hierarchical memory and progressive training on a clinically rich dataset. Empirical results show superior performance on assessment and reasoning tasks, emphasizing the necessity for specialized architecture in safety-critical surgical QA. Future work should prioritize cross-procedural adaptation, uncertainty quantification, and real-time deployment in clinical practice.

Markdown Report Issue