Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 98 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 165 tok/s Pro

GPT OSS 120B 460 tok/s Pro

Claude Sonnet 4 29 tok/s Pro

2000 character limit reached

MedSAM2: Segment Anything in 3D Medical Images and Videos (2504.03600v1)

Published 4 Apr 2025 in eess.IV, cs.AI, and cs.CV

Abstract: Medical image and video segmentation is a critical task for precision medicine, which has witnessed considerable progress in developing task or modality-specific and generalist models for 2D images. However, there have been limited studies on building general-purpose models for 3D images and videos with comprehensive user studies. Here, we present MedSAM2, a promptable segmentation foundation model for 3D image and video segmentation. The model is developed by fine-tuning the Segment Anything Model 2 on a large medical dataset with over 455,000 3D image-mask pairs and 76,000 frames, outperforming previous models across a wide range of organs, lesions, and imaging modalities. Furthermore, we implement a human-in-the-loop pipeline to facilitate the creation of large-scale datasets resulting in, to the best of our knowledge, the most extensive user study to date, involving the annotation of 5,000 CT lesions, 3,984 liver MRI lesions, and 251,550 echocardiogram video frames, demonstrating that MedSAM2 can reduce manual costs by more than 85%. MedSAM2 is also integrated into widely used platforms with user-friendly interfaces for local and cloud deployment, making it a practical tool for supporting efficient, scalable, and high-quality segmentation in both research and healthcare environments.

Summary

The paper details how MedSAM2 improves segmentation in 3D medical data through supervised fine-tuning on over 455K image-mask pairs and 76K video frames.
It adapts the SAM2 architecture with 3D convolutions and temporal fusion to effectively process volumetric and video data via prompt-based segmentation.
The human-in-the-loop pipeline demonstrated over 85% cost savings in annotation, accelerating both research dataset creation and clinical integrations.

MedSAM2, presented in the paper "MedSAM2: Segment Anything in 3D Medical Images and Videos" (2504.03600), is a foundation model designed for promptable segmentation in 3D medical volumetric data and video sequences. It builds upon the Segment Anything Model 2 (SAM2) architecture, adapting it specifically for the complexities of medical imaging modalities and tasks.

Methodology and Adaptation

MedSAM2 leverages the architecture of SAM2 but is specifically fine-tuned for the medical domain. The core methodology involves supervised fine-tuning on a substantial medical dataset comprising over 455,000 3D image-mask pairs and 76,000 video frames. This dataset encompasses a diverse range of anatomical structures (organs, lesions) and imaging modalities, enabling the model to generalize across various medical segmentation tasks.

The adaptation for 3D/video likely involves modifications to handle volumetric input or temporal sequences effectively. While the exact architectural changes from SAM2 are not detailed in the abstract, common approaches for extending 2D models to 3D/video include:

3D Convolutions/Attention: Replacing or augmenting 2D operations with their 3D counterparts within the vision encoder to process volumetric context directly.
Slice-based Processing with Temporal Fusion: Processing individual 2D slices or frames and subsequently fusing the information across the third dimension or time, potentially using recurrent layers, temporal convolutions, or attention mechanisms.
Prompt Engineering for 3D: Adapting the prompt encoder to accept 3D coordinates (points, boxes) or masks specified within the volume or across video frames.

The fine-tuning process optimizes the model parameters to accurately segment structures indicated by user prompts (e.g., points, bounding boxes, masks) within the context of 3D medical images or video frames.

Performance and Evaluation

MedSAM2 demonstrates superior performance compared to previous models across a spectrum of segmentation targets, including various organs and lesions, evaluated on multiple imaging modalities (e.g., CT, MRI, echocardiography). The large-scale fine-tuning dataset is crucial for achieving this broad applicability and improved accuracy.

A significant aspect highlighted is the human-in-the-loop (HITL) pipeline developed alongside MedSAM2. This pipeline facilitates large-scale dataset creation and annotation. An extensive user paper was conducted using this pipeline, involving the annotation of challenging targets: 5,000 CT lesions, 3,984 liver MRI lesions, and 251,550 echocardiogram video frames. The paper quantifies the practical utility of MedSAM2, demonstrating a substantial reduction in manual annotation effort, achieving cost savings of over 85%. This result underscores the model's potential to significantly accelerate medical data curation and research workflows.

Implementation and Deployment

Practical implementation is a key focus of MedSAM2. The model is designed for integration into real-world workflows through user-friendly interfaces and deployment options.

Human-in-the-Loop (HITL) Pipeline: The HITL pipeline, used for the large-scale user paper, exemplifies a practical application pattern. This pipeline likely operates as follows:

Initial Annotation: A user provides sparse prompts (e.g., clicks, bounding box) on a 3D volume or video frame within an annotation tool.
MedSAM2 Prediction: The tool sends the image data and prompts to the MedSAM2 model.
Segmentation Output: MedSAM2 returns the predicted segmentation mask.
User Refinement: The user reviews the mask and provides corrective prompts (adding positive/negative points, refining box boundaries) if necessary.
Iterative Refinement: Steps 2-4 are repeated until the user is satisfied with the segmentation quality.
Data Augmentation (Optional): The refined segmentation can be added back to a training dataset for continuous model improvement or further fine-tuning.

Pseudocode for a basic HITL interaction:

def hitl_segmentation(image_3d, initial_prompts, medsam2_model, annotation_tool):
    """
    Performs interactive segmentation using MedSAM2 within an annotation tool.

    Args:
        image_3d: The 3D medical image volume or video sequence.
        initial_prompts: User-provided initial points, boxes, or rough mask.
        medsam2_model: The loaded MedSAM2 model instance.
        annotation_tool: The interface for user interaction and display.

    Returns:
        final_mask: The user-approved segmentation mask.
    """
    current_prompts = initial_prompts
    satisfied = False
    final_mask = None

    while not satisfied:
        # 1. Get prediction from MedSAM2
        predicted_mask, quality_score = medsam2_model.predict(image_3d, current_prompts)

        # 2. Display prediction and get user feedback
        annotation_tool.display(image_3d, predicted_mask)
        user_feedback = annotation_tool.get_user_feedback() # e.g., {'status': 'refine', 'new_prompts': {...}} or {'status': 'accept'}

        # 3. Update based on feedback
        if user_feedback['status'] == 'accept':
            satisfied = True
            final_mask = predicted_mask
        elif user_feedback['status'] == 'refine':
            current_prompts = update_prompts(current_prompts, user_feedback['new_prompts'])
        else: # e.g., 'reject' or 'cancel'
            break # Exit loop

    return final_mask

def update_prompts(existing_prompts, new_prompts):
    # Logic to add positive/negative points, update boxes, etc.
    # ...
    return combined_prompts

Deployment: MedSAM2 is integrated into widely used platforms, facilitating both local and cloud deployment. This suggests the availability of packaged models (e.g., ONNX, TensorFlow Lite, PyTorch checkpoints) and potentially containerized solutions (e.g., Docker) or cloud-based APIs. Compatibility with platforms like 3D Slicer, MONAI Label, or specific vendor PACS systems would significantly enhance its adoption in clinical research and potentially diagnostic workflows. Considerations for deployment include:

Computational Resources: While fine-tuned from SAM2, foundation models are typically large. Deployment requires sufficient GPU memory and compute power, especially for processing large 3D volumes or video sequences interactively. Inference speed is critical for HITL responsiveness.
Latency: Low latency is essential for interactive segmentation tasks within the HITL pipeline. Cloud deployment necessitates efficient data transfer and compute allocation.
Integration: APIs or plugins are needed for seamless integration into existing medical imaging software and platforms.

Applications in Research and Healthcare

MedSAM2 serves as a versatile tool for segmentation tasks in both medical research and healthcare environments.

Research: Accelerates the creation of large, accurately annotated datasets for training task-specific models, conducting quantitative image analysis studies, and exploring anatomical/pathological variability. The >85% reduction in annotation cost significantly lowers the barrier for large-scale studies.
Healthcare: Potential applications include assisting radiologists and clinicians in tasks like tumor delineation for treatment planning, organ volume quantification, surgical planning, and analyzing dynamic processes in cardiac videos. Its promptable nature allows clinicians to quickly segment regions of interest with minimal interaction.

The development of MedSAM2 represents a significant step towards general-purpose, efficient, and high-quality segmentation tools for 3D medical imaging and video analysis. Its foundation on SAM2, combined with extensive medical-specific fine-tuning and demonstrated practical utility via a large user paper and platform integration, positions it as a valuable asset for the field.