Vivim: a Video Vision Mamba for Medical Video Segmentation (2401.14168v4)

Published 25 Jan 2024 in cs.CV

Abstract: Medical video segmentation gains increasing attention in clinical practice due to the redundant dynamic references in video frames. However, traditional convolutional neural networks have a limited receptive field and transformer-based networks are mediocre in constructing long-term dependency from the perspective of computational complexity. This bottleneck poses a significant challenge when processing longer sequences in medical video analysis tasks using available devices with limited memory. Recently, state space models (SSMs), famous by Mamba, have exhibited impressive achievements in efficient long sequence modeling, which develops deep neural networks by expanding the receptive field on many vision tasks significantly. Unfortunately, vanilla SSMs failed to simultaneously capture causal temporal cues and preserve non-casual spatial information. To this end, this paper presents a Video Vision Mamba-based framework, dubbed as Vivim, for medical video segmentation tasks. Our Vivim can effectively compress the long-term spatiotemporal representation into sequences at varying scales with our designed Temporal Mamba Block. We also introduce an improved boundary-aware affine constraint across frames to enhance the discriminative ability of Vivim on ambiguous lesions. Extensive experiments on thyroid segmentation, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos demonstrate the effectiveness and efficiency of our Vivim, superior to existing methods. The code is available at: https://github.com/scott-yjyang/Vivim. The dataset will be released once accepted.

Authors (6)

Yijun Yang (46 papers)
Zhaohu Xing (16 papers)
Lei Zhu (280 papers)
Chunwang Huang (2 papers)
Lequan Yu (89 papers)
Huazhu Fu (185 papers)

Citations (30)

View on Semantic Scholar

Summary

The paper introduces Vivim that leverages state space models to capture long-range temporal dependencies for improved medical video segmentation.
The paper’s Temporal Mamba Block replaces traditional self-attention with an efficient spatiotemporal selective scan, yielding enhanced segmentation accuracy.
The boundary-aware constraint optimizes edge detection, leading to superior Dice and Jaccard scores in thyroid and polyp segmentation tasks.

Vivim: A State Space Model Framework for Medical Video Object Segmentation

The paper under consideration presents an innovative approach to the challenges in medical video object segmentation by introducing Vivim, a Video Vision Mamba-based framework. This framework leverages state space models (SSMs), specifically the Mamba model, to efficiently address the complexities associated with long-range temporal dependencies inherent in video analysis tasks. The focus is on improving segmentation in medical videos, where factors such as ambiguous lesion boundaries and dynamic tissue changes over time create significant challenges.

Methodology and Architecture

Vivim uses an SSM-inspired architecture to model long sequence data efficiently. Traditional convolutional neural networks (CNNs) and transformer-based networks suffer from limitations in receptive fields or computational complexity when handling long video sequences. Conversely, the state-of-the-art Mamba models have demonstrated promising results in long sequence modeling. Vivim capitalizes on these advancements by integrating Mamba modules into a hierarchical and multi-level transformer framework, designed explicitly for video segmentation tasks. This approach aims to surmount the inefficiencies faced by existing architectures in capturing spatiotemporal cues.

Temporal Mamba Block: A key component of the Vivim architecture, this block replaces the conventional self-attention module with a state space sequence model that captures both spatial and temporal dependencies linearly with respect to sequence length. This is achieved through a spatiotemporal selective scan (ST-Mamba), which enables parallel exploration of intra- and inter-frame relationships.
Boundary-aware Constraint: To further enhance segmentation accuracy, particularly around ambiguous lesion boundaries, Vivim integrates a boundary-aware constraint. This feature utilizes an affine transformation aimed at optimizing edge predictions, providing structural coherence in the segmentation output.

Results

Extensive experimentation was conducted on two tasks: thyroid segmentation in ultrasound videos and polyp segmentation in colonoscopy videos. The results indicate superior performance by Vivim over state-of-the-art models. On metrics such as Dice coefficient and Jaccard index, Vivim consistently achieved higher scores, demonstrating its effectiveness in capturing long-range dependencies without the computational burdens associated with traditional transformer models.

In the experiments, Vivim not only produced more accurate segmentations but also improved processing speed, making it a practical choice for clinical applications. The boundary-aware constraint particularly contributed to enhanced discrimination of complex tissue structures.

Implications and Future Work

The implications of this research are significant, particularly in computer-aided diagnosis and treatment planning, where precise object segmentation is crucial. By addressing computational efficiency and segmentation precision, Vivim provides a robust framework that could be generalized to other video-based medical analysis tasks.

The introduction of SSMs into video segmentation is noteworthy, suggesting future exploration into other areas of AI where long-term dependency modeling is critical. Future developments could focus on further optimizing the ST-Mamba blocks for scalability and incorporating additional domain-specific constraints to enhance segmentation outputs further. Moreover, as Mamba modules and similar SSMs evolve, their integration with advanced neural architectures could redefine methodologies in video-based AI applications beyond medical imaging.

PDF Markdown

Related Papers

GitHub

GitHub - scott-yjyang/Vivim: Vivim: a Video Vision Mamba for Medical Video Lesion Segmentation (151 stars)

Tweets

https://twitter.com/semisance/status/1750898279640924547

https://twitter.com/soumpaul/status/1750875549054943372

https://twitter.com/skylerrosling/status/1752000694998229151