GroundingDINO-V1.5 with SAM-2
- The paper presents a novel framework that integrates an open-vocabulary detector with a high-fidelity mask segmentation model for prompt-driven ultrasound imaging.
- It leverages a cross-modal decoder with free-form text prompts and selective LoRA fine-tuning to achieve accurate multi-organ segmentation.
- The approach offers real-time inference and robust generalization across six organ systems, outperforming state-of-the-art methods in mask quality.
GroundingDINO v1.5 integrated with SAM-2, as instantiated in the GroundingDINO-US-SAM framework, is a prompt-driven vision-language architecture for text-prompted multi-organ segmentation in ultrasound imaging. The system combines an open-vocabulary object detector (GroundingDINO v1.5), adapted to the medical domain with Low-Rank Adaptation (LoRA), and a frozen, high-fidelity mask segmentation model (SAM-2), operating with free-form text prompts to delineate diverse anatomical regions with high generalizability. The approach is motivated by the challenge of anatomical heterogeneity, diverse imaging protocols, and scarcity of annotated datasets in medical ultrasound, enabling scalable, prompt-driven, zero-shot segmentation across six organ systems using only a small fraction of parameter updates for adaptation (Rasaee et al., 30 Jun 2025).
1. Model Architecture and Data Flow
The pipeline processes each image-text pair as follows:
- Input Preparation: Images are resized to and paired with an unrestricted free-form text prompt (e.g., "segment renal capsule", "malignant lesion").
- Text Encoding: The text is tokenized and mapped into a sequence of embeddings using a frozen BERT-style encoder, producing -dimensional representations with context-aware LoRA adapters.
- Visual Encoding: The image is processed by a frozen backbone (ResNet-50 or Swin-T from GroundingDINO v1.5), outputting multi-scale visual features further refined by a transformer “feature enhancer.”
- Cross-Modal Grounding Decoder: A set of learnable object queries interact via cross-attention with both visual features and text embeddings. Each decoder layer outputs bounding-box parameters and open-vocabulary classification logits.
- Bounding-Box Head: A small MLP predicts and logits (token-by-token match plus "no object").
- Mask Prompting to SAM-2: The top-0 predicted bounding-boxes are used as spatial prompts for the frozen SAM-2 mask decoder, which returns high-resolution segmentation masks 1.
Modifications to the base architectures are as follows:
- GroundingDINO v1.5: All pretrained weights are frozen except 21.7% (LoRA adapters) in feature enhancer, cross-modal decoder, and text encoder modules; the bounding-box regression head remains fully trainable.
- SAM-2: Used entirely off-the-shelf, with no fine-tuning; relies on predicted boxes as spatial prompts.
2. LoRA Fine-Tuning and Optimization
GroundingDINO adaptation is performed using LoRA [2], injecting low-rank adapters into selected self-attention and feed-forward modules. For any adapted matrix 3, the output is
4
with 5, 6, and 7 for all adapted layers.
Attention and FFN calculations are modified as:
- Attention:
8
- FFN:
9
Only the LoRA adapters and box head are updated during fine-tuning. Training uses the AdamW optimizer (learning rate 0, weight decay 1), batch size 4, early stopping with patience of 20 epochs, and data augmentation including random flips, cropping, erasing, and scale jittering.
The overall loss combines bounding-box regression, GIoU [5], and focal loss [6] for classification:
- 2, 3, 4, 5, 6.
3. Text Prompt Design and Encoding
The system supports unbounded free-form prompts, facilitating segmentation of arbitrary organ subregions or pathological findings. Example prompts include "segment kidney cortex," "delineate renal capsule," "malignant lesion," "liver parenchyma," and "thyroid nodule." Prompt encoding consists of:
- Tokenization using GroundingDINO's BERT vocabulary.
- Embedding to 512-dimensional vectors.
- Processing through frozen self-attention layers, with selective LoRA adaptation.
- Cross-attention interaction with decoder queries, enabling open-vocabulary localization.
Sensitivity analyses demonstrate that appropriately worded prompts meaningfully affect segmentation performance. For kidney structures, "segment renal cortex" yields 7, 8, while "segment kidney capsule" achieves 9, 0.
4. Datasets, Training, and Evaluation
The system is trained and validated on 15 datasets spanning six anatomies (breast, thyroid, liver, prostate, kidney, muscle), with a total of 16,805 images, and is tested on three held-out datasets ("BUSBRA," "TNSCUI," "Luminous") for out-of-distribution generalization (2,808 images).
Evaluation metrics are:
- Intersection over Union (IoU): 1
- Dice Similarity Coefficient (DSC): 2
Detection AP is not reported: the primary focus is final mask quality.
5. Quantitative Results
Comprehensive experiments demonstrate the following segmentation performance (DSC % / IoU %) on representative datasets (all values from (Rasaee et al., 30 Jun 2025)):
Ablation of Segmentation Back-End
| Organ | Dataset | DINO+SAM2 (no FT) | FT DINO+MedSAM | FT DINO+SAM2 |
|---|---|---|---|---|
| Breast | BrEaST | 23.9 / 18.1 | 49.6 / 37.3 | 76.7 / 66.2 |
| Breast | BUID | 44.1 / 36.1 | 61.4 / 49.4 | 82.9 / 75.8 |
| Prostate | MicroSeg | 61.4 / 47.5 | 56.8 / 42.8 | 88.6 / 81.2 |
| Average | 34.5 ±26 / 26.6±24 | 50.4±27/38.5±23 | 74.0±24/65.7±24 |
Fine-tuning GroundingDINO with LoRA and coupling to frozen SAM-2 is optimal across almost all anatomies.
Comparison With SOTA on Seen Datasets
| Method | DSC Mean | IoU Mean |
|---|---|---|
| UniverSeg | 47.3 | 35.8 |
| BiomedParse | 62.0 | 53.4 |
| SAMUS | 60.0 | 48.0 |
| MedCLIP-SAM | 62.8 | 50.5 |
| MedCLIP-SAMv2 | 68.9 | 59.1 |
| Ours | 79.0 | 71.1 |
Generalization to Unseen Datasets
| Method | DSC Mean | IoU Mean |
|---|---|---|
| UniverSeg | 26.6 | 18.3 |
| BiomedParse | 44.0 | 36.8 |
| SAMUS | 68.0 | 57.4 |
| MedCLIP-SAM | 56.6 | 42.5 |
| Ours | 72.8 | 68.5 |
The segmentation system achieves strong generalization, maintaining robust performance on unseen distributions without further model updates.
6. Discussion, Strengths, and Limitations
Strengths
- Supports zero-click, zero-point, text-prompted segmentation across six organ systems.
- Only 31.7% of GroundingDINO parameters updated; all other weights, including SAM-2, are frozen.
- Efficient, real-time inference (40.33 s/image on Titan V).
- Outperforms state-of-the-art segmentation methods (UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, SAMUS) on most seen and unseen datasets.
Weaknesses
- Performance is contingent on the quality of the image-text alignment learned during pretraining; rare or highly specialized prompts may degrade performance.
- Segmentation mask fidelity is tied to bounding-box proposal accuracy; if predicted boxes are poor, SAM-2 outputs degrade.
- Absence of explicit shape priors or local refinement modules affects complex topology handling.
7. Future Directions and Broader Applications
Fine-tuning of LoRA adapters alone offers a lightweight pathway to adapt GroundingDINO-US-SAM for non-ultrasound modalities:
- CT/MRI: Segmentation by prompt (e.g., "segment liver tumor") with minimal LoRA adaptation.
- X-ray: Text prompts such as "rib," "lung nodule" possible with small-scale LoRA fine-tuning.
- Digital Pathology: Prompts for "tumor region," "immune cell aggregate."
- Clinical Integration: Incorporation with voice-driven prompts, PACS systems, and real-time procedural guidance.
These extensions leverage the parameter efficiency and open-vocabulary design of the core system, facilitating scalable model deployment with minimal dependence on large annotated datasets (Rasaee et al., 30 Jun 2025).