GroundingDINO-V1.5 with SAM-2

Updated 1 June 2026

The paper presents a novel framework that integrates an open-vocabulary detector with a high-fidelity mask segmentation model for prompt-driven ultrasound imaging.
It leverages a cross-modal decoder with free-form text prompts and selective LoRA fine-tuning to achieve accurate multi-organ segmentation.
The approach offers real-time inference and robust generalization across six organ systems, outperforming state-of-the-art methods in mask quality.

GroundingDINO v1.5 integrated with SAM-2, as instantiated in the GroundingDINO-US-SAM framework, is a prompt-driven vision-language architecture for text-prompted multi-organ segmentation in ultrasound imaging. The system combines an open-vocabulary object detector (GroundingDINO v1.5), adapted to the medical domain with Low-Rank Adaptation (LoRA), and a frozen, high-fidelity mask segmentation model (SAM-2), operating with free-form text prompts to delineate diverse anatomical regions with high generalizability. The approach is motivated by the challenge of anatomical heterogeneity, diverse imaging protocols, and scarcity of annotated datasets in medical ultrasound, enabling scalable, prompt-driven, zero-shot segmentation across six organ systems using only a small fraction of parameter updates for adaptation (Rasaee et al., 30 Jun 2025).

1. Model Architecture and Data Flow

The pipeline processes each image-text pair $(I, T)$ as follows:

Input Preparation: Images are resized to $800 \times 800$ and paired with an unrestricted free-form text prompt (e.g., "segment renal capsule", "malignant lesion").
Text Encoding: The text $T$ is tokenized and mapped into a sequence of embeddings $(\mathbf{t}_i)_{i=1}^L$ using a frozen BERT-style encoder, producing $d=512$ -dimensional representations with context-aware LoRA adapters.
Visual Encoding: The image $I$ is processed by a frozen backbone (ResNet-50 or Swin-T from GroundingDINO v1.5), outputting multi-scale visual features further refined by a transformer “feature enhancer.”
Cross-Modal Grounding Decoder: A set of learnable object queries $\{\mathbf{q}_i\}_{i=1}^N$ interact via cross-attention with both visual features and text embeddings. Each decoder layer outputs bounding-box parameters $b_i = (x, y, w, h)$ and open-vocabulary classification logits.
Bounding-Box Head: A small MLP predicts $\hat b_i \in \mathbb{R}^4$ and logits $\hat y_i \in \mathbb{R}^{L+1}$ (token-by-token match plus "no object").
Mask Prompting to SAM-2: The top- $800 \times 800$ 0 predicted bounding-boxes are used as spatial prompts for the frozen SAM-2 mask decoder, which returns high-resolution segmentation masks $800 \times 800$ 1.

Modifications to the base architectures are as follows:

GroundingDINO v1.5: All pretrained weights are frozen except $800 \times 800$ 21.7% (LoRA adapters) in feature enhancer, cross-modal decoder, and text encoder modules; the bounding-box regression head remains fully trainable.
SAM-2: Used entirely off-the-shelf, with no fine-tuning; relies on predicted boxes as spatial prompts.

2. LoRA Fine-Tuning and Optimization

GroundingDINO adaptation is performed using LoRA [2], injecting low-rank adapters into selected self-attention and feed-forward modules. For any adapted matrix $800 \times 800$ 3, the output is

$800 \times 800$ 4

with $800 \times 800$ 5, $800 \times 800$ 6, and $800 \times 800$ 7 for all adapted layers.

Attention and FFN calculations are modified as:

Attention:

$800 \times 800$ 8

FFN:

$800 \times 800$ 9

Only the LoRA adapters and box head are updated during fine-tuning. Training uses the AdamW optimizer (learning rate $T$ 0, weight decay $T$ 1), batch size 4, early stopping with patience of 20 epochs, and data augmentation including random flips, cropping, erasing, and scale jittering.

The overall loss combines bounding-box regression, GIoU [5], and focal loss [6] for classification:

$T$ 2, $T$ 3, $T$ 4, $T$ 5, $T$ 6.

3. Text Prompt Design and Encoding

The system supports unbounded free-form prompts, facilitating segmentation of arbitrary organ subregions or pathological findings. Example prompts include "segment kidney cortex," "delineate renal capsule," "malignant lesion," "liver parenchyma," and "thyroid nodule." Prompt encoding consists of:

Tokenization using GroundingDINO's BERT vocabulary.
Embedding to 512-dimensional vectors.
Processing through frozen self-attention layers, with selective LoRA adaptation.
Cross-attention interaction with decoder queries, enabling open-vocabulary localization.

Sensitivity analyses demonstrate that appropriately worded prompts meaningfully affect segmentation performance. For kidney structures, "segment renal cortex" yields $T$ 7, $T$ 8, while "segment kidney capsule" achieves $T$ 9, $(\mathbf{t}_i)_{i=1}^L$ 0.

4. Datasets, Training, and Evaluation

The system is trained and validated on 15 datasets spanning six anatomies (breast, thyroid, liver, prostate, kidney, muscle), with a total of 16,805 images, and is tested on three held-out datasets ("BUSBRA," "TNSCUI," "Luminous") for out-of-distribution generalization (2,808 images).

Evaluation metrics are:

Intersection over Union (IoU): $(\mathbf{t}_i)_{i=1}^L$ 1
Dice Similarity Coefficient (DSC): $(\mathbf{t}_i)_{i=1}^L$ 2

Detection AP is not reported: the primary focus is final mask quality.

5. Quantitative Results

Comprehensive experiments demonstrate the following segmentation performance (DSC % / IoU %) on representative datasets (all values from (Rasaee et al., 30 Jun 2025)):

Ablation of Segmentation Back-End

Organ	Dataset	DINO+SAM2 (no FT)	FT DINO+MedSAM	FT DINO+SAM2
Breast	BrEaST	23.9 / 18.1	49.6 / 37.3	76.7 / 66.2
Breast	BUID	44.1 / 36.1	61.4 / 49.4	82.9 / 75.8
Prostate	MicroSeg	61.4 / 47.5	56.8 / 42.8	88.6 / 81.2
Average		34.5 ±26 / 26.6±24	50.4±27/38.5±23	74.0±24/65.7±24

Fine-tuning GroundingDINO with LoRA and coupling to frozen SAM-2 is optimal across almost all anatomies.

Comparison With SOTA on Seen Datasets

Method	DSC Mean	IoU Mean
UniverSeg	47.3	35.8
BiomedParse	62.0	53.4
SAMUS	60.0	48.0
MedCLIP-SAM	62.8	50.5
MedCLIP-SAMv2	68.9	59.1
Ours	79.0	71.1

Generalization to Unseen Datasets

Method	DSC Mean	IoU Mean
UniverSeg	26.6	18.3
BiomedParse	44.0	36.8
SAMUS	68.0	57.4
MedCLIP-SAM	56.6	42.5
Ours	72.8	68.5

The segmentation system achieves strong generalization, maintaining robust performance on unseen distributions without further model updates.

6. Discussion, Strengths, and Limitations

Strengths

Supports zero-click, zero-point, text-prompted segmentation across six organ systems.
Only $(\mathbf{t}_i)_{i=1}^L$ 31.7% of GroundingDINO parameters updated; all other weights, including SAM-2, are frozen.
Efficient, real-time inference ( $(\mathbf{t}_i)_{i=1}^L$ 40.33 s/image on Titan V).
Outperforms state-of-the-art segmentation methods (UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, SAMUS) on most seen and unseen datasets.

Weaknesses

Performance is contingent on the quality of the image-text alignment learned during pretraining; rare or highly specialized prompts may degrade performance.
Segmentation mask fidelity is tied to bounding-box proposal accuracy; if predicted boxes are poor, SAM-2 outputs degrade.
Absence of explicit shape priors or local refinement modules affects complex topology handling.

7. Future Directions and Broader Applications

Fine-tuning of LoRA adapters alone offers a lightweight pathway to adapt GroundingDINO-US-SAM for non-ultrasound modalities:

CT/MRI: Segmentation by prompt (e.g., "segment liver tumor") with minimal LoRA adaptation.
X-ray: Text prompts such as "rib," "lung nodule" possible with small-scale LoRA fine-tuning.
Digital Pathology: Prompts for "tumor region," "immune cell aggregate."
Clinical Integration: Incorporation with voice-driven prompts, PACS systems, and real-time procedural guidance.

These extensions leverage the parameter efficiency and open-vocabulary design of the core system, facilitating scalable model deployment with minimal dependence on large annotated datasets (Rasaee et al., 30 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GroundingDINO-V1.5 with SAM-2.