Papers
Topics
Authors
Recent
Search
2000 character limit reached

GroundingDINO-V1.5 with SAM-2

Updated 1 June 2026
  • The paper presents a novel framework that integrates an open-vocabulary detector with a high-fidelity mask segmentation model for prompt-driven ultrasound imaging.
  • It leverages a cross-modal decoder with free-form text prompts and selective LoRA fine-tuning to achieve accurate multi-organ segmentation.
  • The approach offers real-time inference and robust generalization across six organ systems, outperforming state-of-the-art methods in mask quality.

GroundingDINO v1.5 integrated with SAM-2, as instantiated in the GroundingDINO-US-SAM framework, is a prompt-driven vision-language architecture for text-prompted multi-organ segmentation in ultrasound imaging. The system combines an open-vocabulary object detector (GroundingDINO v1.5), adapted to the medical domain with Low-Rank Adaptation (LoRA), and a frozen, high-fidelity mask segmentation model (SAM-2), operating with free-form text prompts to delineate diverse anatomical regions with high generalizability. The approach is motivated by the challenge of anatomical heterogeneity, diverse imaging protocols, and scarcity of annotated datasets in medical ultrasound, enabling scalable, prompt-driven, zero-shot segmentation across six organ systems using only a small fraction of parameter updates for adaptation (Rasaee et al., 30 Jun 2025).

1. Model Architecture and Data Flow

The pipeline processes each image-text pair (I,T)(I, T) as follows:

  1. Input Preparation: Images are resized to 800×800800 \times 800 and paired with an unrestricted free-form text prompt (e.g., "segment renal capsule", "malignant lesion").
  2. Text Encoding: The text TT is tokenized and mapped into a sequence of embeddings (ti)i=1L(\mathbf{t}_i)_{i=1}^L using a frozen BERT-style encoder, producing d=512d=512-dimensional representations with context-aware LoRA adapters.
  3. Visual Encoding: The image II is processed by a frozen backbone (ResNet-50 or Swin-T from GroundingDINO v1.5), outputting multi-scale visual features further refined by a transformer “feature enhancer.”
  4. Cross-Modal Grounding Decoder: A set of learnable object queries {qi}i=1N\{\mathbf{q}_i\}_{i=1}^N interact via cross-attention with both visual features and text embeddings. Each decoder layer outputs bounding-box parameters bi=(x,y,w,h)b_i = (x, y, w, h) and open-vocabulary classification logits.
  5. Bounding-Box Head: A small MLP predicts b^iR4\hat b_i \in \mathbb{R}^4 and logits y^iRL+1\hat y_i \in \mathbb{R}^{L+1} (token-by-token match plus "no object").
  6. Mask Prompting to SAM-2: The top-800×800800 \times 8000 predicted bounding-boxes are used as spatial prompts for the frozen SAM-2 mask decoder, which returns high-resolution segmentation masks 800×800800 \times 8001.

Modifications to the base architectures are as follows:

  • GroundingDINO v1.5: All pretrained weights are frozen except 800×800800 \times 80021.7% (LoRA adapters) in feature enhancer, cross-modal decoder, and text encoder modules; the bounding-box regression head remains fully trainable.
  • SAM-2: Used entirely off-the-shelf, with no fine-tuning; relies on predicted boxes as spatial prompts.

2. LoRA Fine-Tuning and Optimization

GroundingDINO adaptation is performed using LoRA [2], injecting low-rank adapters into selected self-attention and feed-forward modules. For any adapted matrix 800×800800 \times 8003, the output is

800×800800 \times 8004

with 800×800800 \times 8005, 800×800800 \times 8006, and 800×800800 \times 8007 for all adapted layers.

Attention and FFN calculations are modified as:

  • Attention:

800×800800 \times 8008

  • FFN:

800×800800 \times 8009

Only the LoRA adapters and box head are updated during fine-tuning. Training uses the AdamW optimizer (learning rate TT0, weight decay TT1), batch size 4, early stopping with patience of 20 epochs, and data augmentation including random flips, cropping, erasing, and scale jittering.

The overall loss combines bounding-box regression, GIoU [5], and focal loss [6] for classification:

  • TT2, TT3, TT4, TT5, TT6.

3. Text Prompt Design and Encoding

The system supports unbounded free-form prompts, facilitating segmentation of arbitrary organ subregions or pathological findings. Example prompts include "segment kidney cortex," "delineate renal capsule," "malignant lesion," "liver parenchyma," and "thyroid nodule." Prompt encoding consists of:

  1. Tokenization using GroundingDINO's BERT vocabulary.
  2. Embedding to 512-dimensional vectors.
  3. Processing through frozen self-attention layers, with selective LoRA adaptation.
  4. Cross-attention interaction with decoder queries, enabling open-vocabulary localization.

Sensitivity analyses demonstrate that appropriately worded prompts meaningfully affect segmentation performance. For kidney structures, "segment renal cortex" yields TT7, TT8, while "segment kidney capsule" achieves TT9, (ti)i=1L(\mathbf{t}_i)_{i=1}^L0.

4. Datasets, Training, and Evaluation

The system is trained and validated on 15 datasets spanning six anatomies (breast, thyroid, liver, prostate, kidney, muscle), with a total of 16,805 images, and is tested on three held-out datasets ("BUSBRA," "TNSCUI," "Luminous") for out-of-distribution generalization (2,808 images).

Evaluation metrics are:

  • Intersection over Union (IoU): (ti)i=1L(\mathbf{t}_i)_{i=1}^L1
  • Dice Similarity Coefficient (DSC): (ti)i=1L(\mathbf{t}_i)_{i=1}^L2

Detection AP is not reported: the primary focus is final mask quality.

5. Quantitative Results

Comprehensive experiments demonstrate the following segmentation performance (DSC % / IoU %) on representative datasets (all values from (Rasaee et al., 30 Jun 2025)):

Ablation of Segmentation Back-End

Organ Dataset DINO+SAM2 (no FT) FT DINO+MedSAM FT DINO+SAM2
Breast BrEaST 23.9 / 18.1 49.6 / 37.3 76.7 / 66.2
Breast BUID 44.1 / 36.1 61.4 / 49.4 82.9 / 75.8
Prostate MicroSeg 61.4 / 47.5 56.8 / 42.8 88.6 / 81.2
Average 34.5 ±26 / 26.6±24 50.4±27/38.5±23 74.0±24/65.7±24

Fine-tuning GroundingDINO with LoRA and coupling to frozen SAM-2 is optimal across almost all anatomies.

Comparison With SOTA on Seen Datasets

Method DSC Mean IoU Mean
UniverSeg 47.3 35.8
BiomedParse 62.0 53.4
SAMUS 60.0 48.0
MedCLIP-SAM 62.8 50.5
MedCLIP-SAMv2 68.9 59.1
Ours 79.0 71.1

Generalization to Unseen Datasets

Method DSC Mean IoU Mean
UniverSeg 26.6 18.3
BiomedParse 44.0 36.8
SAMUS 68.0 57.4
MedCLIP-SAM 56.6 42.5
Ours 72.8 68.5

The segmentation system achieves strong generalization, maintaining robust performance on unseen distributions without further model updates.

6. Discussion, Strengths, and Limitations

Strengths

  • Supports zero-click, zero-point, text-prompted segmentation across six organ systems.
  • Only (ti)i=1L(\mathbf{t}_i)_{i=1}^L31.7% of GroundingDINO parameters updated; all other weights, including SAM-2, are frozen.
  • Efficient, real-time inference ((ti)i=1L(\mathbf{t}_i)_{i=1}^L40.33 s/image on Titan V).
  • Outperforms state-of-the-art segmentation methods (UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, SAMUS) on most seen and unseen datasets.

Weaknesses

  • Performance is contingent on the quality of the image-text alignment learned during pretraining; rare or highly specialized prompts may degrade performance.
  • Segmentation mask fidelity is tied to bounding-box proposal accuracy; if predicted boxes are poor, SAM-2 outputs degrade.
  • Absence of explicit shape priors or local refinement modules affects complex topology handling.

7. Future Directions and Broader Applications

Fine-tuning of LoRA adapters alone offers a lightweight pathway to adapt GroundingDINO-US-SAM for non-ultrasound modalities:

  • CT/MRI: Segmentation by prompt (e.g., "segment liver tumor") with minimal LoRA adaptation.
  • X-ray: Text prompts such as "rib," "lung nodule" possible with small-scale LoRA fine-tuning.
  • Digital Pathology: Prompts for "tumor region," "immune cell aggregate."
  • Clinical Integration: Incorporation with voice-driven prompts, PACS systems, and real-time procedural guidance.

These extensions leverage the parameter efficiency and open-vocabulary design of the core system, facilitating scalable model deployment with minimal dependence on large annotated datasets (Rasaee et al., 30 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GroundingDINO-V1.5 with SAM-2.